Reputation: 8382
I have built a Data Factory pipeline which ETL the data from a Data Lake into a Datawarehouse. I chose the SCD type 1 for my dimensions.
My pipeline contains the following activities:
The problem with this pipeline is that I end up with duplicated fact entries in my warehouse if the run the pipeline twice.
Question
How can I efficiently prevent duplicated rows in my facts table, considering all the unsupported features in Azure SQL Data Warehouse?
Update
I have read another piece of information regarding the indexing (and the statistics) of a warehouse and how it must be rebuilt after an update.
Considering that, the simplest thing that I thought of was to apply the same principle to the facts as the one I am using for the Dimensions. I can load all the new facts in a staging table, but then use an index on the fact table to include only the facts that do not exist (the facts can't be updated right now).
Upvotes: 3
Views: 1672
Reputation: 3078
Do the lifting in Azure SQL Data Warehouse ... your performance will improve dramatically, and your problem will go away.
How many rows are in your filtered files? If it is in the millions to tens of millions, I think you can probably avoid the filter at the data lake stage. The performance of Polybase + SQL should overcome the additional data volume.
If you can avoid the filter, use this logic and throw away the U-SQL processing:
BK = Business Key column/s. COLn = non-key columns
-- Get latest row for each business key to eliminate duplicates.
create table stage2 with (heap,distribution = hash(bk)) as
select bk,
col1,
col2,
row_number() over (partition by bk order by timestamp desc) rownum
from stage
where rownum = 1;
-- Merge the stage into a copy of the dimension
create table dimension_copy with (heap,distribution=replicate) as
select s.bk,
s.col1,
s.col2
from stage2 s
where not exists (
select 1
from schema.dimension d
where d.bk = s.bk)
union
select d.bk,
case when s.bk is null then d.col1 else s.col1 end,
case when s.bk is null then d.col2 else s.col2 end
from dimension d
left outer join stage2 s on s.bk = d.bk;
-- Switch the merged copy with the original
alter table dimension_copy switch to dimension with (truncate_target=on);
-- Force distribution of replicated table across nodes
select top 1 * from dimension;
Upvotes: 2