anders
anders

Reputation: 21

unable to run a datalab job on dataproc

I have set-up datalab to run on a dataproc master node using the datalab initialisation action:

gcloud dataproc clusters create <CLUSTER_NAME> \
--initialization-actions gs://<GCS_BUCKET>/datalab/datalab.sh \
--scopes cloud-platform

This historically has worked OK. However as of 30.5 I can no longer get any code to run, however simple. I just get the "Running" progress bar. No timeouts, no error messages. How can I debug this?

Upvotes: 2

Views: 262

Answers (1)

Patrick Clay
Patrick Clay

Reputation: 1349

I just created a cluster and it seemed to work for me.

Just seeing "Running" usually means that there is not enough room in the cluster to schedule a Spark Application. Datalab loads PySpark when Python loads and that creates a YARN application. Any code will block until the YARN application is scheduled.

On the default 2 node n1-standard-4 worker cluster, with the default configs. There can only be 1 spark application. You should be able to fit two notebooks by setting --properties spark.yarn.am.memory=1g or using a larger cluster, but you will still eventually hit a limit on running notebooks per cluster.

Upvotes: 3

Related Questions