Reputation: 11
We have an elt process to store data partitioned by Year in a delta lake processed through Databricks. In Databricks the queried location displays data correctly with no duplications and no total count variation. When I create a view using Synapse Serverless to the same partitioned data is displayed with duplicates after an update happens to the data, when data is created for the first time no issues whatsoever. I have troubleshot and found that it only happens when using views to partitioned data after an update. If using external table with no partition specified, the results are correct as well.
Delta Lake partitioned data overview
On Databricks data is correctly read.
select PKCOLUMNS, count(*) from mytable group by PKCOLUMNS having count(*)>1
-- no duplicates
select count(*) from mytable --407,421
On Synapse Serverless
CREATE VIEW MY_TABLE_VIEW AS
SELECT *,
results.filepath(1) as [Year]
FROM
OPENROWSET(
BULK 'mytable/Year=*/*.parquet',
DATA_SOURCE = 'DeltaLakeStorage',
FORMAT = 'PARQUET'
)
WITH(
[param1] nvarchar(4000),
[param2] float,
[PKCOLUMNS] nvarchar(4000)
) AS [results]
GO
select PKCOLUMNS, count(*) from mytable
group by PKCOLUMNS
having count(*)>1 --duplicates
GO
select PKCOLUMNS, count(*) from mytable
group by PKCOLUMNS
having count(*)>1 --814,842
Upvotes: 1
Views: 741
Reputation: 1298
The issue is related to the view created in Synapse Serverless.
Instead of using 'OPENROWSET' to access the Delta Lake files directly, you Can try creating an external table example (EXT.EDW_Table1) in Synapse that points to the Delta Lake files.
In this way, you can get the Delta Lake metadata to automatically partition the data and ensure that the partitions are correctly updated when data changes.
Step 1:
CREATE VIEW
CREATE EXTERNAL TABLE external_Table1
param1 string,
param2 float,
PKCOLUMNS string
)
PARTITIONED BY (Year string)
STORED AS PARQUET
LOCATION 'external_Table1
In step 1 delta lake (parquet) files are stored in "external_Table1" and Partition by the YEAR Column.
Step 2:
Try SELECT to check for the duplicates in the VIEW.
SELECT PKCOLUMNS, COUNT(*) FROM external_Table1
GROUP BY PKCOLUMNS HAVING COUNT(*) > 1;
If you still see duplicate values after the update. Tyr using the VACUUM operation on the Delta Lake files. Step 3:
%sql
VACUUM external_Table1;
Please note that the VACUUM operation can take some time to complete depending on the size of your data.
Upvotes: 0