abhishek jha
abhishek jha

Reputation: 1085

Getting warning in pyspark job in Google dataproc when writing files directly to google storage for each part file

I am getting this warning for each part file the spark job is creating when writing to google storage:

17/08/01 11:31:47 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://temp_bucket/output/part-09698
17/08/01 11:31:47 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://temp_bucket/output/part-09698 - removing from cache

The spark job has 10 stages and this warning comes after 9 stages. And since the spark job is creating ~11500 part files. This warning comes for each of the ~11500 part files. Because of this warning my spark job is running for 15 mins extra and since I am running around 80 such jobs. I losing a lot of time and incurring a lot cost.

Is there a way to suppress this warning?

Upvotes: 1

Views: 374

Answers (1)

Dennis Huo
Dennis Huo

Reputation: 10677

Recent changes have made it safe to disable the enforced list-consistency entirely; future releases are expected to phase it out gradually. Try the following in your job properties to disable the CacheSupplementedGoogleCloudStorage:

--properties spark.hadoop.fs.gs.metadata.cache.enable=false

Or if you're creating a new Dataproc cluster, in your cluster properties:

--properties core:fs.gs.metadata.cache.enable=false

Upvotes: 1

Related Questions