Kapil
Kapil

Reputation: 9959

how to make data highly available when spark job write to S3 and athena is used for query

I'm storing a stream of data coming via firehose to S3 and have created tables in Athena wo query this data. The data in S3 is partitioned based on fields like clientID, date. A spark job is processing this incoming data which is coming at regular interval. At each run, spark job takes the data (delta - for that interval) merges it with base data already available in that partition(by last modified time in case there are duplicate records) in S3 and overwrites the partition to save. when S3A committer writing these files, it deletes the existing files and copies the newly created files. Is there a possibility that when querying data from athena tables, it does not return any data because old files are deleted and newly files are not written completely yet. If yes, How do you handle that.

Upvotes: 1

Views: 335

Answers (1)

Dave
Dave

Reputation: 2049

Yes, if the underlying S3 object are deleted, the athena query will return zero rows. The S3A committer will delete objects prior to uploading, and so will always risk some period of time where table backing data is missing or incomplete.

To have athena queries be highly available on data that is being updated, write the query data in batch to a versioned path in S3 (like s3://my-data/2020-02-07) at the appropriate frequency. When a batch has completed, send ALTER TABLE SET LOCATION DDL to the athena DB pointing it to the newest versioned path. Then, cleanup old paths (newest version - n) in line with your retention policy.

Upvotes: 1

Related Questions