Cesar A. Mostacero
Cesar A. Mostacero

Reputation: 770

Apache-spark - Reading data from aws-s3 bucket with glacier objects

The scenario is this:

There is a workaround that "fix" this issue: https://jira.apache.org/jira/browse/SPARK-21797?focusedCommentId=16140408&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16140408. But looking into the code: https://github.com/apache/spark/pull/16474/files, call's are still made and only skipping those files that raise an IOException. Is there any better way to config Spark to only load Standard objects on s3-bucket?.

Upvotes: 0

Views: 808

Answers (1)

stevel
stevel

Reputation: 13430

  1. someone (you?) gets to fix https://issues.apache.org/jira/browse/HADOOP-14837 ; have s3a raise a specific exception when attempts to read glaciated data fails
  2. then spark needs to recognise and skip that when it happens

I don't think S3's LIST call flags when an object is glaciated -so the filtering cannot be done during query planning/partitioning. It will be very expensive to call HEAD for each object at that point in the process.

Upvotes: 0

Related Questions