Reputation: 770
The scenario is this:
spark
to read an s3-bucket
, where some objects (parquet
) were transitioned to glacier
storage class. I'm not trying to read these objects, but there is an error on spark
using these kind of buckets (https://jira.apache.org/jira/browse/SPARK-21797).There is a workaround that "fix" this issue: https://jira.apache.org/jira/browse/SPARK-21797?focusedCommentId=16140408&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16140408. But looking into the code: https://github.com/apache/spark/pull/16474/files, call's are still made and only skipping those files that raise an IOException
. Is there any better way to config Spark
to only load Standard
objects on s3-bucket
?.
Upvotes: 0
Views: 808
Reputation: 13430
I don't think S3's LIST call flags when an object is glaciated -so the filtering cannot be done during query planning/partitioning. It will be very expensive to call HEAD for each object at that point in the process.
Upvotes: 0