Dirk Sachse
Dirk Sachse

Reputation: 13

Azure Data Factory - Copy Data Task - Upsert is not updating

I am using a copy data activity in Azure Data Factory. The source dataset is a Table on a Azure SQL server and the sink is aswell. The source data looks like this:

Source data

The sink data table has the same structure as the source, but I only want 1 record per taxonomie_id and dim_date. The value screenPageViews should be updated.

This is the sink data I get after the first run of the Copy Data Activity, with Upsert activated and taxonomie_id and dim_date columns as keys:

Sink after 1st run

I thought I would end up with just one record per key combination.

Something like this:

79540;20230316;1

79560;20230316;2

79530;20230316;3

It even gets worse if I run the pipeline a second time. I then get this:

Sink data after second run

I now have twice as many records as before and at this point, the data is useless and wrong.

This the source tab in the Copy Data Activity:

Source configuration of copy data activity

This is the query:

SELECT 
taxonomie_id,
dim_date,
screenPageViews 
FROM stg0_ga4.dateScreenPageViews_Page 
ORDER BY 2 DESC,1 DESC;

This is the sink configuration:

Sink configuration of copy data activity

What am I doing wrong? Do I missunderstand the concept of the upsert functionality?

Upvotes: 0

Views: 1799

Answers (2)

Pratik Lad
Pratik Lad

Reputation: 8382

What am I doing wrong? Do I missunderstand the concept of the upsert functionality?

Working of UPSERT function in copy activity.

  • When a key column value is missing from the target database, the upsert command adds data and changes the values of other rows.
  • As a result, it is updating all entries without regard to data modifications. There is no method to instruct copy action to ignore the situation where a complete row already appears in the target database, unlike SQL's Merge function.
  • Therefore, even when the value of key_column fits, it will change the values for the other columns.
  • Upsert function check all the rows of source to all the rows of sink and the based upon matching on key columns it will update or insert record in sink.

I only want 1 record per taxonomie_id and dim_date.

Agreed with Chen Hirsh as per your assumed Output and you want only 1 record per taxonomie_id and dim_dateyou so you have to group them and for screenPageViews you have to either group by function or aggregate function to select max (aggregate function) screenPageViews by grouping other two columns.

With Group By function:

SELECT taxonomie_id, dim_date, screenPageViews
FROM [dbo].[sql1demo]
GROUP  BY taxonomie_id,dim_date,screenPageViews
Order  by  2  desc,  1  desc

This will give you only 1 record with per taxonomie_id and dim_date

Without group by function or aggregate function for screenPageViews it will throw you an error as:

enter image description here

I dont want the max value, I want the data to be updated in that order.

  • If sink also has multiple entries it will update that it will not give you single entry for all similar entries.
  • If sink doesn't have any data the upsert function will insert all the data that is not matching with key columns.

Upvotes: 0

Chen Hirsh
Chen Hirsh

Reputation: 1400

The upsert moves all rows from source to sink, and since you have multiple rows per key in your source, it would move all the rows to sink. Try an aggregation function on your source query:

SELECT 
taxonomie_id,
dim_date,
max(screenPageViews) as screenPageViews
FROM stg0_ga4.dateScreenPageViews_Page 
GROUP BY taxonomie_id,dim_date

To only move one rows per key. Just make user the max aggregate is the right function in you case.

Upvotes: 0

Related Questions