Use GCS staging directory for Spark jobs (on Dataproc)

Question

I'm trying to change the Spark staging directory to prevent the loss of data on worker decommisionning (on google dataproc with Spark 2.4).
I want to switch the HDFS staging to Google Cloud Storage staging.

When I run this command :

spark-submit --conf "spark.yarn.stagingDir=gs://my-bucket/my-staging/"  gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f/src/pyspark/hello-world/hello-world.py

I have this error :

org.apache.spark.SparkException: Application application_1560413919313_0056 failed 2 times due to AM Container for appattempt_1560413919313_0056_000002 exited with exitCode: -1000

Failing this attempt.Diagnostics: [2019-06-20 07:58:04.462]File not found : gs:/my-staging/.sparkStaging/application_1560413919313_0056/pyspark.zip java.io.FileNotFoundException: File not found : gs:/my-staging/.sparkStaging/application_1560413919313_0056/pyspark.zip

The Spark job fails but the .sparkStaging/ directory is created on GCS.

Any idea on this issue ?

Thanks.

Use GCS staging directory for Spark jobs (on Dataproc)

Answers (1)

Related Questions