supremed14
supremed14

Reputation: 91

Asking the appropriate spec of cluster for Google Dataproc to handle our data

I am trying to handle the somewhat big data for Kaggle Competition.

The amount of the data to handle is about 80Gb and it has 2 billion rows x 6 columns.

The data was put in Google Cloud Storage and tried to handle this with Google Datalab, but since the data is too big, we've encountered the error message.

So we're trying to use Pyspark with Google Dataproc system.

About this I have two question:

1) Is this option enough?

2) Is Google Compute Engine needed to handle the Google Dataproc cluster systems? If so, which is suitable in this case?

Thank you for reading this and I will be waiting for your answers :)

Thanks!

Upvotes: 0

Views: 468

Answers (1)

marcyb5st
marcyb5st

Reputation: 252

So, fist of all I will try to address to Compute Engine vs Dataproc question and then moving to sizing the cluster.

Compute Engine is Google's IaaS offering and it's basically a service to spin up VMs. Google Dataproc uses Google Compute Engine to spin up the Virtual Machines that will act as node/master in your cluster. Moreover, Dataproc already install and configures several things on the nodes, so you don't have to take care of it. If you need more stuff on the nodes, Google maintains a set of scripts that can be used install additional dependencies on the cluster. So, answering your question you need Google Compute Engine in the sense that without it you won't be able to spin up a cluster. And, if you're already set for using PySpark, Dataproc is the right choice.

Regarding the size, it really depends which kind of analysis you are running and if the data is evenly distributed. If you have a hot key/shard whose data is bigger than the memory of a single Node you need to increase the node size. If the computation is CPU intensive, then add cores. The good thing about Google Dataproc is that you can spin up a cluster in 90 seconds and tear it down in around the same time. This should give you the possibility to experiment quite a bit!

Hope this helps!

Upvotes: 2

Related Questions