Synapse serverless query delta lake partitioned data displays data duplicated

Question

We have an elt process to store data partitioned by Year in a delta lake processed through Databricks. In Databricks the queried location displays data correctly with no duplications and no total count variation. When I create a view using Synapse Serverless to the same partitioned data is displayed with duplicates after an update happens to the data, when data is created for the first time no issues whatsoever. I have troubleshot and found that it only happens when using views to partitioned data after an update. If using external table with no partition specified, the results are correct as well.

Delta Lake partitioned data overview

enter image description here

On Databricks data is correctly read.

select PKCOLUMNS, count(*) from mytable group by PKCOLUMNS having count(*)>1 -- no duplicates

select count(*) from mytable --407,421

On Synapse Serverless

CREATE VIEW MY_TABLE_VIEW AS 
SELECT *, 
results.filepath(1) as [Year]
FROM
OPENROWSET(
BULK 'mytable/Year=*/*.parquet',
DATA_SOURCE = 'DeltaLakeStorage',
FORMAT = 'PARQUET'
) 
WITH(
[param1] nvarchar(4000),
[param2] float,
[PKCOLUMNS] nvarchar(4000)
) AS [results]
GO
select PKCOLUMNS, count(*) from mytable
group by PKCOLUMNS
having count(*)>1 --duplicates
GO
select PKCOLUMNS, count(*) from mytable
group by PKCOLUMNS
having count(*)>1 --814,842

Synapse serverless query delta lake partitioned data displays data duplicated

Answers (1)

Related Questions