Jonathan Porter
Jonathan Porter

Reputation: 1556

SSIS - Fast way of detecting true deletes and updating data warehouse

I'm looking for an efficient way of detecting deleted records in production and updating the data warehouse to reflect those deletes because the table is > 12M rows and contains transactional data used for accounting purposes.

Originally, everything was done in a stored procedure by somebody before me and I've been tasked with moving the process to SSIS.

Here is what my test pattern looks like so far:

enter image description here

Inside the Data Flow Task:

enter image description here

I'm using MD5 hashes to speed up the ETL process as demonstrated in this article.

This should give a huge speed boost to the process by not having to store so many rows in memory for comparison purposes and by removing the bulk of conditional split processing at the same time.

But the issue is it doesn't account for records that are deleted in production.

How should I go about doing this? It may be simple to you but I'm new to SSIS so I'm not sure how to ask correctly.

Thank you in advance.

Upvotes: 2

Views: 2257

Answers (2)

Jonathan Porter
Jonathan Porter

Reputation: 1556

The solution I ended up using was to add another Data Flow Task and use the Lookup transformation to find records that didn't exist in production when compared to our fact table. This task comes after all of the inserts and updates as shown in my question above.

Then we can batch delete missing records in an execute SQL task.

Inside Data Flow Task:

DFT1

Inside Lookup Transformation:

(note the Redirect rows to no match output)

DFT2

DFT3

Lookup

So, if the ID's don't match those rows will be redirected to the no match output which we set to go to our staging table. Then, we will join staging to the fact table and apply the deletions as shown below inside an execute SQL task.

EST

Upvotes: 3

john McTighe
john McTighe

Reputation: 1181

I think you'll need to adopt you dataflow to use a merge join instead of a lookup. That way you can see whats new/changed & deleted. You'll need to sort both Flows by the same joining key (in this case your hash column).

Personally i'm not sure I'd bother and Instead I'd simply stage all my prod data and then do a 3-way SQL merge statement to handle Inserts updates & deletes in one pass. You can keep your hash column as a joining key if you like.

Upvotes: 0

Related Questions