Data proc has a fully managed and higher scale service for running a past spark fling present starting press open source tools and frameworks, which is useful data like modernization etl on secure data science. So first i provision the crowd shell to get the file to use on this video. So i will use the gsu to command and copy and specify the path to the file, and this file is called a benchmark by file and download on the current directly and to the see the contents of the file you suggest, you tell cal command. Actually, i dont need to just detail just cut command and uh. Here we go, this is the file i can see around. This is basically the algorithm to calculate the number of the pi, and you can see the unit square is defined here and uh. Basically, there takes the uh argument, which is the partitions for the script, so ill go to the uh data pro console, but before that ill actually go to crowd stretch and to make the uh crowd stretch bucket um. This will be um id of this project, name and uh doesnt matter any illusion. I just provision this bucket here then i will go to data rock console and i will create a cluster and Music. So the name of the cluster and specify the lesion and uh ill configure using the you can actually change the version of your um image, so ill use, debian, hadoop, 3.
2, spark 3.1 and configure the null. So i will use the machine type and astronaut 2. The worker arnold, i will configure the two worker nulls and the machine type. Actually here you can specify as well and which is again, i use the same machine type for the walker notes. As you master node, and here you can specify the cross trace station packet. This is used for storing the cluster drive, all the other. Some outputs are the staging packets, so i will check again just youtube and ls to check my pocket just create it and uh. This is the my project name and the bucket that i created so im. Gon na type here and just cut off the unnecessary part, then i left the other configuration as default and create a cluster. So it will take some time to be provisioned, so um, meanwhile ill clear, um, this terminal and also uh. So here basically the cluster. Is you update the procrasti which maintain um your hardware part so uh? For example, if you see the cluster details and uh, there is a bm instance, and this is the um computer engine uh. So this is the master node and i provision the two worker nodes, which is uh defined as a row as a walker. So if you go to the computer engine console, you can check – and this is a node which is configured using the machine type and on sana 2, which i specified um.
So so you can check volcano, which is provisioning, a game provision. So these are just provisioned and i go back to cluster and the status is provisioning. Another configuration you can check. So this is the primary disk size, 500 gigabyte, but also you can have the larger disk size um. You can also configure the details, as you like, and also you can check the web interface uh, some job information, which is not yet um created and, of course you can check the monitors to tell your memory or hdfs information and disk bytes and so on. Okay, so, while provisioning the cluster um, i actually go to the job page and submit the job and lets try. If i can provision the job. Yes, it looks like i can provision jobs, so pi spark and the pass file would be um. The file that i created here so this will be the benchmark file, but im gon na use this file from the um station bucket. So again, i will check the stage packer, which is the the project name of this, my current usage. So i will again just util copy command and uh download this benchmark file from local to um, my just util. So here you go. This is download to the uh. This is my just util files, so here i can specify my gsutile then benchmark pi, and this will fetch the benchmark pi file from my uh juice packet. Just now create it and argument here um.
As i mentioned, i can specify some argument. I will use the 21st and then max per hour. I will use it to uh just one hour beginning so, as i made the job and its successfully submitted, so the status is now running so now the job is running on the cruster. So if i go back to the my crosshair, yes, which is successfully running and see the metrics uh, now its slightly start using some information, it will take some time to the job is um created, so its still running has password. So you can check the wraps time, so how long does it take to earn jobs completed information um by the way? If you like to see other information, especially in details or hardware, you can count it being an instance and check the information from brea instant side, and there is ability. So i can check this throughput network packets kind of information from this site as well uh. This shows the output of your jobs and the configuration as well and its actually succeeded. So if you go back, you can check uh lowering jobs, and here we go. The status is now succeeded, so your first job is successfully loaned and the provision isnt succeeded and so lets go to the next job. So i will submit the next show, so it will be exactly the same, so i will use uh the same file from the same bucket, so this will be the benchmark pi, and this time the argument will be 220, so the partition number will be at 220.
Uh, the one max uh lester powers, so basically, what im doing here is that if i see the yes benchmark, pi file, it will increase the number of the um here and which is specified to uh calculate this number. So it might, it usually takes more time than the first 20, as argument, so lets see how it will be reflected on the jobs and also metrics and which will show me on the monitor. So it will take some time and obviously it takes more time than the first one, which is only the 20 and the wraps time takes 48 seconds already. So, if ill refresh this page, yes, its youre learning its all. That is firstly, one minute, so it might take some more time um. Meanwhile, i will prepare the one thing here so provision the crowd show again. So what im gon na do right now is after finishing this show i will um update my cluster. So, as you can see now my cluster information, it has the um two worker notes and one masternode, which can be seen from the premium instance so right here there is this mustang and two volcanoes im gon na increase the number of the volcanoes, so that the Cluster can work efficiently to handle a jaw which takes extra time for examples. The second job i just leave it as a default name, but basically the second one. So it takes a more time like um like twice time on the first one.
So i will um update this across and see if this will affect at any time for the um job execution, so gcrow data proc. Then, clusters update and ill use it custom, demo, cluster and the legion. This is u.s central one, then here num walkers then ill specify five, so lets start operating to update the cluster. So if i go to the uh cluster page here, you see the update in the cluster and also go to bm instance, and now you see uh the first. It has only two walker cluster, but now it has a five cluster so its easily to configure your cluster using the uh command line and when you have the larger files or the the file that takes more processing, its usually recommended to uh increase the cluster. So its a horizontal scaling instead of the vertical um scaling, which just changed your computer um type machine type, because basically, those additional workers have the same tab as the previous workers machine type. So if i go to the page, you can check uh anyone sign up. Two is that has a machine type if i also go to the air first to a machine type im in the workers. Uh, the machine type is the same exactly the same and on sternum two and um so lets rock, and now the provision is finished. So go to job and i will submit a new job lets, say job3. This will be exactly the same as the previous jobs configuration so use, actually that just util and check the my uh just crowd, stretched pass here we go.
I will use it to specify the pass. This is the bench mark pi, then argument is exactly the same 220 and exactly the same configuration here on summit job and see if this will um reduce the um total time to process for the exactly the same jobs. So i will go back to the jobs page and now i can see a receiver still running so um this one and this one is exactly the same configuration and the first one takes one minute 47 seconds. So now i add the three walker notes and to see if this will help to reduce the elapsed time or the total processing time lets update the page again. It is still running, for example, this one is the previous one. If you see uh texas, cpu utilization. Fifty percent and some information and um it is a lot of memories and if i see the new jobs, uh sims using less memory and the cpt rejection is same, but for the use for the a lot of the same workers and there you go, it succeeded.