Funzo
Funzo

Reputation: 1290

How to run hudi on dataproc and write to gcs bucket

I want to write to a gcs bucket from dataproc using hudi.

To write to gcs using hudi it says to set prop fs.defaultFS to value gs:// (https://hudi.apache.org/docs/gcs_hoodie)

However when I set fs.defaultFS on dataproc to be a gcs bucket I get errors at startup relating to the job not being able to find my jar. It is looking in a gs:/ prefix, presumably because I have overridden defaultFs which it was previously using the find the jar. How would I fix this?

org.apache.spark.SparkException: Application application_1617963833977_0009 failed 2 times due to AM Container for appattempt_1617963833977_0009_000002 exited with  exitCode: -1000
Failing this attempt.Diagnostics: [2021-04-12 15:36:05.142]java.io.FileNotFoundException: File not found : gs:/user/root/.sparkStaging/application_1617963833977_0009/myjar.jar

If it is relevant I am setting the defaultFs from within the code. sparkConfig.set("spark.hadoop.fs.defaultFS", gs://defaultFs)

Upvotes: 2

Views: 738

Answers (1)

Dagang Wei
Dagang Wei

Reputation: 26458

You can try setting fs.defaultFS to GCS when creating the cluster. For example:

gcloud dataproc clusters create ...\
   --properties 'core:fs.defaultFS=gs://my-bucket'

Upvotes: 2

Related Questions