Weston Sankey
Weston Sankey

Reputation: 343

Spark S3 Eventual Consistency Issues

I have several Spark jobs that write data to and read data from S3. Occasionally (about once per week for approximately 3 hours), the Spark jobs will fail with the following exception:

org.apache.spark.sql.AnalysisException: Path does not exist.

I've uncovered that this is likely due to the consistency model in S3, where list operations are eventually consistent. S3 Guard claims to solve this issue, but I'm in a Spark environment that doesn't support that utility.

Has anyone else run into this issue and figured out a reasonable approach for dealing with it?

Upvotes: 1

Views: 1893

Answers (1)

stevel
stevel

Reputation: 13430

  • If you are using AWS EMR, they offer consistent EMR.
  • if you are using Databricks: they offer a consistency mechanism in their transactional IO
  • Both HDP and CDH ship with S3Guard
  • if you are running your own home-rolled spark stack, , move to Hadoop 2.9+ to get S3Guard, even better: Hadoop 3.1 for the zero-rename S3A committer.

Otherwise: don't use S3 as your direct destination of work.

Upvotes: 1

Related Questions