Dirk Sachse
Dirk Sachse

Reputation: 13

Copy Data Task - Upsert produces duplicate entries

I have a pipeline, that loads the last 7 days of data from Google Analytics 4 into a landing table. From there, I want to copy it into my data warehouse and I dont want to get any duplicates.

This is my source table:

CREATE TABLE [lnd_ga4].[dateScreenPageViews_Page](
    [sk_id] [int] IDENTITY(1,1) NOT NULL,
    [taxonomie_id] [bigint] NULL,
    [dim_date] [varchar](512) NULL,
    [dim_fullPageUrl] [varchar](512) NULL,
    [dim_pagePath] [varchar](512) NULL,
    [dim_articleId] [varchar](512) NULL,
    [dim_articleType] [varchar](512) NULL,
    [dim_pageReferrer] [varchar](512) NULL,
    [dim_pageTitle] [varchar](512) NULL,
    [dim_sessionSource] [varchar](512) NULL,
    [dim_type] [char](2) NULL,
    [engagedSessions] [bigint] NULL,
    [screenPageViews] [bigint] NULL,
    [last_updated] [datetime] NOT NULL
) ON [PRIMARY]
GO

ALTER TABLE [lnd_ga4].[dateScreenPageViews_Page] ADD  DEFAULT (getdate()) FOR [last_updated]
GO

This is the sink table:

CREATE TABLE [stg0_ga4].[dateScreenPageViews_Page](
    [sk_id] [int] IDENTITY(1,1) NOT NULL,
    [taxonomie_id] [bigint] NULL,
    [dim_date] [varchar](512) NULL,
    [dim_fullPageUrl] [varchar](512) NULL,
    [dim_pagePath] [varchar](512) NULL,
    [dim_articleId] [varchar](512) NULL,
    [dim_articleType] [varchar](512) NULL,
    [dim_pageReferrer] [varchar](512) NULL,
    [dim_pageTitle] [varchar](512) NULL,
    [dim_sessionSource] [varchar](512) NULL,
    [dim_type] [char](2) NULL,
    [engagedSessions] [bigint] NULL,
    [screenPageViews] [bigint] NULL,
    [last_updated] [datetime] NOT NULL
) ON [PRIMARY]
GO

ALTER TABLE [stg0_ga4].[dateScreenPageViews_Page] ADD  DEFAULT (getdate()) FOR [last_updated]
GO

This is the SQL query of the copy data activity:

SELECT 
  CAST(
    dbo.GA4_getTaxonomieID(
      CONCAT('www.xyz.com', dim_pagePath)
    ) AS bigint
  ) AS taxonomie_id, 
  TRIM(dim_date) as dim_date, 
  TRIM(
    CONCAT('www.xyz.com', dim_pagePath)
  ) AS dim_fullPageUrl, 
  TRIM(dim_pagePath) as dim_pagePath, 
  TRIM(
    dbo.GA4_getArticleID(
      CONCAT('www.xyz.com', dim_pagePath)
    )
  ) AS dim_articleId, 
  TRIM(
    dbo.GA4_getArticleType(
      CONCAT('www.xyz.com', dim_pagePath)
    )
  ) AS dim_articleType, 
  TRIM(dim_pageReferrer) AS dim_pageReferrer, 
  TRIM(dim_pageTitle) AS dim_pageTitle, 
  TRIM(dim_sessionSource) AS dim_sessionSource, 
  TRIM(
    dbo.GA4_getType(
      CONCAT('www.xyz.com', dim_pagePath)
    )
  ) AS dim_type, 
  engagedSessions, 
  screenPageViews
FROM 
  lnd_ga4.dateScreenPageViews_Page 
ORDER BY 
  sk_id DESC

This the source configuration of the Azure Data Factroy copy data activity:

This is the sink side:

This is the expression for the key columns:

@json(replace(string(pipeline().parameters.keys),'\',''))

This is the array for the key columns:

["taxonomie_id","dim_date","dim_fullPageUrl","dim_pagePath","dim_articleId","dim_articleType","dim_pageReferrer","dim_pageTitle","dim_sessionSource","dim_type"]

This is the source data output of the SQL query above:

I have limited it to 10 records.

This is the content of the sink table after the first run:

Now, if I restart the pipeline, without changing anything, I end up with twice as many rows as before:

I ve tested the column keys on a much simpler data set and it is working. I even imported the mapping scheme and entered all keys manually. The behaviour is the same.

I must be doing something wrong, but I cant figure it out.

Upvotes: 0

Views: 482

Answers (1)

Dirk Sachse
Dirk Sachse

Reputation: 13

Key columns cant be NULL. I put a COALESCE() arround every key column and now it works

Upvotes: 0

Related Questions