Reputation: 1085
I am getting this warning for each part file the spark job is creating when writing to google storage:
17/08/01 11:31:47 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://temp_bucket/output/part-09698
17/08/01 11:31:47 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://temp_bucket/output/part-09698 - removing from cache
The spark job has 10 stages and this warning comes after 9 stages. And since the spark job is creating ~11500 part files. This warning comes for each of the ~11500 part files. Because of this warning my spark job is running for 15 mins extra and since I am running around 80 such jobs. I losing a lot of time and incurring a lot cost.
Is there a way to suppress this warning?
Upvotes: 1
Views: 374
Reputation: 10677
Recent changes have made it safe to disable the enforced list-consistency entirely; future releases are expected to phase it out gradually. Try the following in your job properties to disable the CacheSupplementedGoogleCloudStorage:
--properties spark.hadoop.fs.gs.metadata.cache.enable=false
Or if you're creating a new Dataproc cluster, in your cluster properties:
--properties core:fs.gs.metadata.cache.enable=false
Upvotes: 1