How can I efficiently prevent duplicated rows in my facts table?

Question

I have built a Data Factory pipeline which ETL the data from a Data Lake into a Datawarehouse. I chose the SCD type 1 for my dimensions.

My pipeline contains the following activities:

[Stored Procedure] Clear staging tables;
[Stored Procedure] Get the timestamp of the last successful update;
[U-SQL] Extract the dimension data from the filtered files (the ones that have been modified since the last successful update) in Azure Data Lake, transform it and output it in a csv file;
[Copy Data] Load the csv into a SQL datawarehouse staging dimension table;
[Stored Procedure] Merge the data from the staging table into the production table;
[U-SQL] Extract the fact data from the files (the ones that have been modified since the last successful update) in Azure Data Lake, transform it and output it in a csv file;
[Copy Data] Load the csv into a SQL datawarehouse fact table;
[Stored Procedure] Update the timestamp of the successful update.

The problem with this pipeline is that I end up with duplicated fact entries in my warehouse if the run the pipeline twice.

Question

How can I efficiently prevent duplicated rows in my facts table, considering all the unsupported features in Azure SQL Data Warehouse?

Update

I have read another piece of information regarding the indexing (and the statistics) of a warehouse and how it must be rebuilt after an update.

Considering that, the simplest thing that I thought of was to apply the same principle to the facts as the one I am using for the Dimensions. I can load all the new facts in a staging table, but then use an index on the fact table to include only the facts that do not exist (the facts can't be updated right now).

How can I efficiently prevent duplicated rows in my facts table?

Answers (1)

Related Questions