Reputation: 13
I am using a copy data activity in Azure Data Factory. The source dataset is a Table on a Azure SQL server and the sink is aswell. The source data looks like this:
The sink data table has the same structure as the source, but I only want 1 record per taxonomie_id and dim_date. The value screenPageViews should be updated.
This is the sink data I get after the first run of the Copy Data Activity, with Upsert activated and taxonomie_id and dim_date columns as keys:
I thought I would end up with just one record per key combination.
Something like this:
79540;20230316;1
79560;20230316;2
79530;20230316;3
It even gets worse if I run the pipeline a second time. I then get this:
I now have twice as many records as before and at this point, the data is useless and wrong.
This the source tab in the Copy Data Activity:
This is the query:
SELECT
taxonomie_id,
dim_date,
screenPageViews
FROM stg0_ga4.dateScreenPageViews_Page
ORDER BY 2 DESC,1 DESC;
This is the sink configuration:
What am I doing wrong? Do I missunderstand the concept of the upsert functionality?
Upvotes: 0
Views: 1799
Reputation: 8382
What am I doing wrong? Do I missunderstand the concept of the upsert functionality?
Working of UPSERT
function in copy activity.
upsert
command adds data and changes the values of other rows.Merge
function.key_column
fits, it will change the values for the other columns.Upsert
function check all the rows of source to all the rows of sink and the based upon matching on key columns it will update or insert record in sink.I only want 1 record per taxonomie_id and dim_date.
Agreed with Chen Hirsh as per your assumed Output and you want only 1 record per taxonomie_id
and dim_dateyou
so you have to group them and for screenPageViews
you have to either group by function or aggregate function to select max
(aggregate function) screenPageViews by grouping other two columns.
With Group By
function:
SELECT taxonomie_id, dim_date, screenPageViews
FROM [dbo].[sql1demo]
GROUP BY taxonomie_id,dim_date,screenPageViews
Order by 2 desc, 1 desc
This will give you only 1 record with per taxonomie_id and dim_date
Without group by function or aggregate function for screenPageViews it will throw you an error as:
I dont want the max value, I want the data to be updated in that order.
upsert
function will insert all the data that is not matching with key columns.Upvotes: 0
Reputation: 1400
The upsert moves all rows from source to sink, and since you have multiple rows per key in your source, it would move all the rows to sink. Try an aggregation function on your source query:
SELECT
taxonomie_id,
dim_date,
max(screenPageViews) as screenPageViews
FROM stg0_ga4.dateScreenPageViews_Page
GROUP BY taxonomie_id,dim_date
To only move one rows per key. Just make user the max aggregate is the right function in you case.
Upvotes: 0