CODEWITHSUNDEEP

amazon-web-servicesapache-sparkamazon-s3parquetamazon-glacier

Cesar A. Mostacero

Cesar A. Mostacero

Reputation: 770

Apache-spark - Reading data from aws-s3 bucket with glacier objects

The scenario is this:

I'm using spark to read an s3-bucket, where some objects (parquet) were transitioned to glacier storage class. I'm not trying to read these objects, but there is an error on spark using these kind of buckets (https://jira.apache.org/jira/browse/SPARK-21797).

There is a workaround that "fix" this issue: https://jira.apache.org/jira/browse/SPARK-21797?focusedCommentId=16140408&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16140408. But looking into the code: https://github.com/apache/spark/pull/16474/files, call's are still made and only skipping those files that raise an IOException. Is there any better way to config Spark to only load Standard objects on s3-bucket?.

Upvotes: 0

Views: 821

Answers (1)

stevel

Reputation: 13480

someone (you?) gets to fix https://issues.apache.org/jira/browse/HADOOP-14837 ; have s3a raise a specific exception when attempts to read glaciated data fails
then spark needs to recognise and skip that when it happens

I don't think S3's LIST call flags when an object is glaciated -so the filtering cannot be done during query planning/partitioning. It will be very expensive to call HEAD for each object at that point in the process.

Upvotes: 0

Related Questions