Handling Duplicates in Data Warehouse

Question

I was going through the below link for handling Data Quality issues in a data warehouse.

http://www.kimballgroup.com/2007/10/an-architecture-for-data-quality/

" Responding to Quality Events I have already remarked that each quality screen has to decide what happens when an error is thrown. The choices are: 1) halting the process, 2) sending the offending record(s) to a suspense file for later processing, and 3) merely tagging the data and passing it through to the next step in the pipeline. The third choice is by far the best choice. "

In some dimensional feeds (like Client list), sometimes we get a same Client twice (the two records having difference in certain attributes). What is the best solution in this scenario?

I don't want to reject both records (as that would mean incomplete client data).
The source systems are very slow in fixing the issue, so we get the same issues every day. That means a manual fix to the problem also is tough as it has to be done every day (we receive the client list everyday).
Selecting a single record is not possible as we don't know what the correct value is.
Having both the records in our warehouse means our joins are disrupted. Because of two rows for the same ID, the fact table rows are doubled (in a join).

Any thoughts?

Handling Duplicates in Data Warehouse

Answers (1)

Related Questions