Sarthak Agrawal
Sarthak Agrawal

Reputation: 586

Reading version specific files delta-lake

I want to read the delta data after a certain timestamp/version. The logic here suggests to read the entire data and read the specific version, and then find the delta. As my data is huge, I would prefer not to read the entire data and if somehow be able to read only the data after certain timestamp/version.

Any suggestions?

Upvotes: 1

Views: 1997

Answers (1)

Alex Ott
Alex Ott

Reputation: 87164

If you need data that have timestamp after some specific date, then you still need to shift through all data. But Spark & Delta Lake may help here if you organize your data correctly:

  • You can have time-based partitions, for example, store data by day/week/month, so when Spark will read data it may read only specific partitions (perform so-called predicate pushdown), for example, df = spark.read.format("delta").load(...).filter("day > '2021-12-29'") - this will work not only for Delta, but for other formats as well. Delta Lake may additionally help here because is supports so-called generated columns where you don't need to create a partition column explicitly, but allow Spark to generate it for you based on other columns

  • On top of partitioning, formats like Parquet (and Delta that is based on Parquet) allow to skip reading all data because they maintain the min/max statistics inside the files. But you will still need to read these files

  • On Databricks, Delta Lake has more capabilities for selective read of the data - for example, that min/max statistics that Parquet has inside the file, could be saved into the transaction log, so Delta won't need to open file to check if timestamp in the given range - this technique is called data skipping. Additional performance could come from the ZOrdering of the data that will collocate data closer to each other - that's especially useful when you need to filter by multiple columns

Update 14.04.2022: Data Skipping is also available in OSS Delta, starting with version 1.2.0

Upvotes: 2

Related Questions