Raghvendra Yadav
Raghvendra Yadav

Reputation: 1

How to ensure Atomicity and Data Integrity in Spark Queries During Parquet File Overwrites for Compression Optimization?

I have a Spark setup where partitions with original Parquet files exist, and queries are actively running on these partitions. I'm running a background job to optimize these Parquet files for better compression, which involves changing the Parquet object layout. How can I ensure that the Parquet file overwrites are atomic and do not fail or cause data integrity issues in Spark queries? What are the possible solutions?

We cannot use data lake house because of legacy challenges.

Upvotes: 0

Views: 97

Answers (1)

Islam Elbanna
Islam Elbanna

Reputation: 1757

This is an open question without more details about the use case, but I can give you a few thoughts:

  • Try to always partition your data by date/hour so you can safely optimize each part separately and process old data which should not be modified.
  • Make sure the optimization job is written into a new location and after finishing writing to the new location:
    • If the schema has not changed then you can replace the current data with the optimized data.
    • If the schema changed then keep both data to have consistency for each dataset and be able to reprocess if needed. And move queries to use the new location.
  • You could have some validation checks before writing the optimized data like the expected number of records or some other business metrics or check-sums and fail the job in case of a failure in any check.

Upvotes: 0

Related Questions