Fred Rouvier
Fred Rouvier

Reputation: 61

Use GCS staging directory for Spark jobs (on Dataproc)

I'm trying to change the Spark staging directory to prevent the loss of data on worker decommisionning (on google dataproc with Spark 2.4).
I want to switch the HDFS staging to Google Cloud Storage staging.

When I run this command :

spark-submit --conf "spark.yarn.stagingDir=gs://my-bucket/my-staging/"  gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f/src/pyspark/hello-world/hello-world.py

I have this error :

org.apache.spark.SparkException: Application application_1560413919313_0056 failed 2 times due to AM Container for appattempt_1560413919313_0056_000002 exited with exitCode: -1000

Failing this attempt.Diagnostics: [2019-06-20 07:58:04.462]File not found : gs:/my-staging/.sparkStaging/application_1560413919313_0056/pyspark.zip java.io.FileNotFoundException: File not found : gs:/my-staging/.sparkStaging/application_1560413919313_0056/pyspark.zip

The Spark job fails but the .sparkStaging/ directory is created on GCS.

Any idea on this issue ?

Thanks.

Upvotes: 3

Views: 3104

Answers (1)

Ben Sidhom
Ben Sidhom

Reputation: 1588

First, it's important to realize that the staging directory is primarily used for staging artifacts for executors (primarily jars and other archives) rather than for storing intermediate data as a job executes. If you want to preserve intermediate job data (primarily shuffle data) following worker decommissioning (e.g., after machine preemption or scale down), then Dataproc Enhanced Flexibility Mode (currently in alpha) may help you.

Your command works for me on both Dataproc image versions 1.3 and 1.4. Make sure that your target staging bucket exists and that the Dataproc cluster (i.e., the service account that the cluster runs as) has read and write access to the bucket. Note that the GCS connector will not create buckets for you.

Upvotes: 2

Related Questions