Shalanki Gupta
Shalanki Gupta

Reputation: 131

Using an existing dataproc cluster to run dask

I have a dataproc cluster running on the Google Cloud Platform. I intend to passing this cluster in the dask client instead of initializing a new dask-yarn cluster

However, I am not able to use my dataproc cluster directly

#Instead of :
cluster = YarnCluster(environment='environment.tar.gz',worker_vcores=2, worker_memory="8GiB")
cluster.scale(10)
client = Client(cluster)

#Directly using my dataproc cluster:
client = Client(my-dataproc-cluster)

Upvotes: 2

Views: 1185

Answers (1)

jiminy_crist
jiminy_crist

Reputation: 2445

DataProc creates a new Hadoop cluster, dask-yarn is for creating dask clusters that run inside your hadoop cluster (wherever that may be). To run properly it requires properly setup python environments and configuration, just as any other tool on hadoop would (spark included).

We don't have a dataproc specific guide, but the one for AWS's equivalent EMR is here: http://yarn.dask.org/en/latest/aws-emr.html

For deploying on DataProc you'd likely create an equivalent initialization action to the EMR bootstrap action: https://github.com/dask/dask-yarn/blob/master/deployment_resources/aws-emr/bootstrap-dask

Upvotes: 3

Related Questions