Dataproc cluster runs a maximum of 5 jobs in parallel, ignoring available resources

Question

I am loading data from 1200 MS SQL Server tables into BigQuery with a spark job. It's all part of an orchestrated ETL process where the spark job consists of scala code which receives messages from PubSub. So 1200 messages are being received over a period of around an hour. Each message triggers code (async) which reads data from a table, with minor transformations, and writes to BigQuery. The process itself works fine. My problem is that the number of active jobs in spark never goes above 5, in spite of a lot of "jobs" waiting and plenty of resources being available.

I've tried upping the spark.driver.cores to 30, but there is no change. Also, this setting, while visible in the Google Console, doesn't seem to make it through to the actual spark job (when viewed in the spark UI). Here is the spark job running in the console:

And here are the spark job properties:

It's a pretty big cluster, with plenty of resources to spare:

Here is the command line for creating the cluster:

gcloud dataproc clusters create odsengine-cluster \
--properties dataproc:dataproc.conscrypt.provider.enable=false,spark:spark.executor.userClassPathFirst=true,spark:spark.driver.userClassPathFirst=true \
--project=xxx \
--region europe-north1 \
--zone europe-north1-a \
--subnet xxx \
--master-machine-type n1-standard-4 \
--worker-machine-type m1-ultramem-40 \
--master-boot-disk-size 30GB \
--worker-boot-disk-size 2000GB \
--image-version 1.4 \
--master-boot-disk-type=pd-ssd \
--worker-boot-disk-type=pd-ssd \
--num-workers=2 \
--scopes cloud-platform \
--initialization-actions gs://xxx/cluster_init/init_actions.sh

And the command line for submitting the spark job:

gcloud dataproc jobs submit spark \
--project=velliv-dwh-development \
--cluster odsengine-cluster \
--region europe-north1 \
--jars gs://velliv-dwh-dev-bu-dcaods/OdsEngine_2.11-0.1.jar \
--class Main \
--properties \
spark.executor.memory=35g,\
spark.executor.cores=2,\
spark.executor.memoryOverhead=2g,\
spark.dynamicAllocation.enabled=true,\
spark.shuffle.service.enabled=true,\
spark.driver.cores=30\
-- yarn

I am aware that I could look into using partitioning to spread out the load of large individual tables, and I've also had that working in another scenario with success, but in this case I just want to load many tables at once without partitioning each table.

Dagang Wei · Accepted Answer

Regarding "a lot of jobs waiting and plenty of resources being available", I'd suggest you check Spark log, YARN web UI and log to see if there are pending applications and why. It also helps to check the cluster web UI's monitoring tab for YARN resource utilization.

Regarding the spark.driver.cores problem, it is effective only in cluster mode, see this doc:

Number of cores to use for the driver process, only in cluster mode

Spark driver runs in client mode by default in Dataproc, which means the driver runs on the master node outside of YARN. You can run the driver in cluster mode as a YARN container with property spark.submit.deployMode=cluster.

Dataproc cluster runs a maximum of 5 jobs in parallel, ignoring available resources

Answers (1)

Related Questions