Reputation: 103
We are loading source files every day in delta lake table(table name: dst1) as time series data. If deltalake table does not exist we will create a table. If table does exist we will merge the new data as time series data.
Creating Table if table does not exist:
spark.sql(f"CREATE TABLE {dtable} USING DELTA LOCATION {dmount1}")
If table does exist we will merge the new data as time series data:
here df --> has data for day 1
dtable --> has data for day 0.
jkey --> d.id=ds.id
spark.sql(f"MERGE INTO {dtable} d USING df ds ON {jkey} WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *")
Data from source on day1:
Data from source on day2:
select * from dst1 VERSION AS OF 0
id lname fname gender satscore
1 Steve Jobs M 1100
2 Mapple Syrup F 1300
4 Tom Cruise M 1200
6 New Record F 1320
select * from dst1 VERSION AS OF 1
id lname fname gender satscore
1 Steve Not M 2000
2 Mapple Syrup F 1300
4 Tom Cruise M 1200
6 New Record F 1320
Since day 2 does not have id 2,4,6 but still I see them in version 1 of delta lake table. I am expecting only id 1 in version 1 of delta table. How can I achieve this?
https://docs.databricks.com/delta/delta-update.html#language-sql
Upvotes: 2
Views: 1078
Reputation: 87069
That's not how Delta works - each version contains all data that should be in it. It's not a combination of the all versions to represent an unified view of all data. If you need only entry from the day 2 in the new version, you just need to overwrite data, but then you will need to read each version separately to build the view of the whole table.
Upvotes: 1