Reputation: 13
I have a pipeline, that loads the last 7 days of data from Google Analytics 4 into a landing table. From there, I want to copy it into my data warehouse and I dont want to get any duplicates.
This is my source table:
CREATE TABLE [lnd_ga4].[dateScreenPageViews_Page](
[sk_id] [int] IDENTITY(1,1) NOT NULL,
[taxonomie_id] [bigint] NULL,
[dim_date] [varchar](512) NULL,
[dim_fullPageUrl] [varchar](512) NULL,
[dim_pagePath] [varchar](512) NULL,
[dim_articleId] [varchar](512) NULL,
[dim_articleType] [varchar](512) NULL,
[dim_pageReferrer] [varchar](512) NULL,
[dim_pageTitle] [varchar](512) NULL,
[dim_sessionSource] [varchar](512) NULL,
[dim_type] [char](2) NULL,
[engagedSessions] [bigint] NULL,
[screenPageViews] [bigint] NULL,
[last_updated] [datetime] NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [lnd_ga4].[dateScreenPageViews_Page] ADD DEFAULT (getdate()) FOR [last_updated]
GO
This is the sink table:
CREATE TABLE [stg0_ga4].[dateScreenPageViews_Page](
[sk_id] [int] IDENTITY(1,1) NOT NULL,
[taxonomie_id] [bigint] NULL,
[dim_date] [varchar](512) NULL,
[dim_fullPageUrl] [varchar](512) NULL,
[dim_pagePath] [varchar](512) NULL,
[dim_articleId] [varchar](512) NULL,
[dim_articleType] [varchar](512) NULL,
[dim_pageReferrer] [varchar](512) NULL,
[dim_pageTitle] [varchar](512) NULL,
[dim_sessionSource] [varchar](512) NULL,
[dim_type] [char](2) NULL,
[engagedSessions] [bigint] NULL,
[screenPageViews] [bigint] NULL,
[last_updated] [datetime] NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [stg0_ga4].[dateScreenPageViews_Page] ADD DEFAULT (getdate()) FOR [last_updated]
GO
This is the SQL query of the copy data activity:
SELECT
CAST(
dbo.GA4_getTaxonomieID(
CONCAT('www.xyz.com', dim_pagePath)
) AS bigint
) AS taxonomie_id,
TRIM(dim_date) as dim_date,
TRIM(
CONCAT('www.xyz.com', dim_pagePath)
) AS dim_fullPageUrl,
TRIM(dim_pagePath) as dim_pagePath,
TRIM(
dbo.GA4_getArticleID(
CONCAT('www.xyz.com', dim_pagePath)
)
) AS dim_articleId,
TRIM(
dbo.GA4_getArticleType(
CONCAT('www.xyz.com', dim_pagePath)
)
) AS dim_articleType,
TRIM(dim_pageReferrer) AS dim_pageReferrer,
TRIM(dim_pageTitle) AS dim_pageTitle,
TRIM(dim_sessionSource) AS dim_sessionSource,
TRIM(
dbo.GA4_getType(
CONCAT('www.xyz.com', dim_pagePath)
)
) AS dim_type,
engagedSessions,
screenPageViews
FROM
lnd_ga4.dateScreenPageViews_Page
ORDER BY
sk_id DESC
This the source configuration of the Azure Data Factroy copy data activity:
This is the sink side:
This is the expression for the key columns:
@json(replace(string(pipeline().parameters.keys),'\',''))
This is the array for the key columns:
["taxonomie_id","dim_date","dim_fullPageUrl","dim_pagePath","dim_articleId","dim_articleType","dim_pageReferrer","dim_pageTitle","dim_sessionSource","dim_type"]
This is the source data output of the SQL query above:
I have limited it to 10 records.
This is the content of the sink table after the first run:
Now, if I restart the pipeline, without changing anything, I end up with twice as many rows as before:
I ve tested the column keys on a much simpler data set and it is working. I even imported the mapping scheme and entered all keys manually. The behaviour is the same.
I must be doing something wrong, but I cant figure it out.
Upvotes: 0
Views: 482
Reputation: 13
Key columns cant be NULL. I put a COALESCE() arround every key column and now it works
Upvotes: 0