Kzryzstof
Kzryzstof

Reputation: 8382

How can I efficiently prevent duplicated rows in my facts table?

I have built a Data Factory pipeline which ETL the data from a Data Lake into a Datawarehouse. I chose the SCD type 1 for my dimensions.

My pipeline contains the following activities:

  1. [Stored Procedure] Clear staging tables;
  2. [Stored Procedure] Get the timestamp of the last successful update;
  3. [U-SQL] Extract the dimension data from the filtered files (the ones that have been modified since the last successful update) in Azure Data Lake, transform it and output it in a csv file;
  4. [Copy Data] Load the csv into a SQL datawarehouse staging dimension table;
  5. [Stored Procedure] Merge the data from the staging table into the production table;
  6. [U-SQL] Extract the fact data from the files (the ones that have been modified since the last successful update) in Azure Data Lake, transform it and output it in a csv file;
  7. [Copy Data] Load the csv into a SQL datawarehouse fact table;
  8. [Stored Procedure] Update the timestamp of the successful update.

The problem with this pipeline is that I end up with duplicated fact entries in my warehouse if the run the pipeline twice.

Question

How can I efficiently prevent duplicated rows in my facts table, considering all the unsupported features in Azure SQL Data Warehouse?

Update

I have read another piece of information regarding the indexing (and the statistics) of a warehouse and how it must be rebuilt after an update.

Considering that, the simplest thing that I thought of was to apply the same principle to the facts as the one I am using for the Dimensions. I can load all the new facts in a staging table, but then use an index on the fact table to include only the facts that do not exist (the facts can't be updated right now).

Upvotes: 3

Views: 1672

Answers (1)

Ron Dunn
Ron Dunn

Reputation: 3078

Do the lifting in Azure SQL Data Warehouse ... your performance will improve dramatically, and your problem will go away.

How many rows are in your filtered files? If it is in the millions to tens of millions, I think you can probably avoid the filter at the data lake stage. The performance of Polybase + SQL should overcome the additional data volume.

If you can avoid the filter, use this logic and throw away the U-SQL processing:

  • Ingest files to staging table with suitable hash distribution
  • Take the latest version of each row (suitable for SCD1)
  • Merge stage to fact using a query like this:

BK = Business Key column/s. COLn = non-key columns

-- Get latest row for each business key to eliminate duplicates.

create table stage2 with (heap,distribution = hash(bk)) as
select  bk,
        col1,
        col2,
        row_number() over (partition by bk order by timestamp desc) rownum
from    stage
where   rownum = 1;

-- Merge the stage into a copy of the dimension

create table dimension_copy with (heap,distribution=replicate) as

select    s.bk,
          s.col1,
          s.col2
from      stage2 s
where     not exists (
              select  1
              from    schema.dimension d
              where   d.bk = s.bk)

union

select   d.bk,
         case when s.bk is null then d.col1 else s.col1 end,
         case when s.bk is null then d.col2 else s.col2 end
from     dimension d
         left outer join stage2 s on s.bk = d.bk;

-- Switch the merged copy with the original 

alter table dimension_copy switch to dimension with (truncate_target=on);

-- Force distribution of replicated table across nodes

select top 1 * from dimension;

Upvotes: 2

Related Questions