nl09
nl09

Reputation: 103

Delta Lake: how to not carry deleted records in next version of delta table?

We are loading source files every day in delta lake table(table name: dst1) as time series data. If deltalake table does not exist we will create a table. If table does exist we will merge the new data as time series data.

Creating Table if table does not exist:
spark.sql(f"CREATE TABLE {dtable} USING DELTA LOCATION {dmount1}")       
 

If table does exist we will merge the new data as time series data:
here df --> has data for day 1 
     dtable --> has data for day 0.
     jkey --> d.id=ds.id

spark.sql(f"MERGE INTO {dtable} d USING df ds ON {jkey} WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *")

Data from source on day1:

Data from source on day1

Data from source on day2:

Data from source on day2

select * from dst1 VERSION AS OF 0
id  lname   fname   gender  satscore
1   Steve   Jobs    M   1100
2   Mapple  Syrup   F   1300
4   Tom     Cruise  M   1200
6   New     Record  F   1320

select * from dst1 VERSION AS OF 1
id  lname   fname   gender  satscore
1   Steve   Not     M   2000
2   Mapple  Syrup   F   1300
4   Tom     Cruise  M   1200
6   New     Record  F   1320

Since day 2 does not have id 2,4,6 but still I see them in version 1 of delta lake table. I am expecting only id 1 in version 1 of delta table. How can I achieve this?

https://docs.databricks.com/delta/delta-update.html#language-sql

Upvotes: 2

Views: 1078

Answers (1)

Alex Ott
Alex Ott

Reputation: 87069

That's not how Delta works - each version contains all data that should be in it. It's not a combination of the all versions to represent an unified view of all data. If you need only entry from the day 2 in the new version, you just need to overwrite data, but then you will need to read each version separately to build the view of the whole table.

Upvotes: 1

Related Questions