Reputation: 586
I want to read the delta data after a certain timestamp/version. The logic here suggests to read the entire data and read the specific version, and then find the delta. As my data is huge, I would prefer not to read the entire data and if somehow be able to read only the data after certain timestamp/version.
Any suggestions?
Upvotes: 1
Views: 1997
Reputation: 87164
If you need data that have timestamp after some specific date, then you still need to shift through all data. But Spark & Delta Lake may help here if you organize your data correctly:
You can have time-based partitions, for example, store data by day/week/month, so when Spark will read data it may read only specific partitions (perform so-called predicate pushdown), for example, df = spark.read.format("delta").load(...).filter("day > '2021-12-29'")
- this will work not only for Delta, but for other formats as well. Delta Lake may additionally help here because is supports so-called generated columns where you don't need to create a partition column explicitly, but allow Spark to generate it for you based on other columns
On top of partitioning, formats like Parquet (and Delta that is based on Parquet) allow to skip reading all data because they maintain the min/max statistics inside the files. But you will still need to read these files
On Databricks, Delta Lake has more capabilities for selective read of the data - for example, that min/max statistics that Parquet has inside the file, could be saved into the transaction log, so Delta won't need to open file to check if timestamp in the given range - this technique is called data skipping. Additional performance could come from the ZOrdering of the data that will collocate data closer to each other - that's especially useful when you need to filter by multiple columns
Update 14.04.2022: Data Skipping is also available in OSS Delta, starting with version 1.2.0
Upvotes: 2