Michael
Michael

Reputation: 2657

SSIS only extract Delta changes

After some advice. I'm using SSIS\SQL Server 2014. I have a nightly SSIS package that pulls in data from non-SQL Server db's into a single table (the SQL table is truncated beforehand each time) and I then extract from this table to create a daily csv file.

Going forward, I only want to extract to csv on a daily basis the records that have changed i.e. the Deltas.

What is the best approach? I was thinking of using CDC in SSIS, but as I'm truncating the SQL table before the initial load each time, will this be best method? Or will I need to have a master table in SQL with an initial load, then import into another table and just extract where there are different? For info, the table in SQL contains a Primary Key.

I just want to double check as CDC assumes the tables are all in SQL Server, whereas my data is coming from outside SQL Server first.

Thanks for any help.

Upvotes: 2

Views: 2172

Answers (1)

Eric Brandt
Eric Brandt

Reputation: 8101

The primary key on that table is your saving grace here. Obviously enough, the SQL Server database that you're pulling the disparate data into won't know from one table flush to the next which records have changed, but if you add two additional tables, and modify the existing table with an additional column, it should be able to figure it out by leveraging HASHBYTES.

For this example, I'll call the new table SentRows, but you can use a more meaningful name in practice. We'll call the new column in the old table HashValue.

Add the column HashValue to your table as a varbinary data type. NOT NULL as well.

Create your SentRows table with columns for all the columns in the main table's primary key, plus the HashValue column.

Create a RowsToSend table that's structurally identical to your main table, including the HashValue.

Modify your queries to create the HashValue by applying HASHBYTES to all of the non-key columns in the table. (This will be horribly tedious. Sorry about that.)

Send out your full data set.

Now move all of the key values and HashValues to the SentRows table. Truncate your main table.

On the next pull, compare the key values and HashValues from SentRows to the new data in the main table.

Primary key match + hash match = Unchanged row

Primary key match + hash mismatch = Updated row

Primary key in incoming data but missing from existing data set = New row

Primary key not in incoming data but in existing data set = Deleted row

Pull out any changes you need to send to the RowsToSend table.

Send the changes from RowsToSend.

Move the key values and HashValues to your SentRows table. Update hashes for changed key values, insert new rows, and decide how you're going to handle deletes, if you have to deal with deletes.

Truncate the SentRows table to get ready for tomorrow.

If you'd like (and you'll thank yourself later if you do) add a computed column to the SentRows table with default of GETDATE(), which will tell you when the row was added.

And away you go. Nothing but deltas from now on.

Edit 2019-10-31:

Step by step (or TL;DR):

1) Flush and Fill MainTable.

2) Compare keys and hashes on MainTable to keys and hashes on SentRows to identify new/changed rows.

3) Move new/changed rows to RowsToSend.

4) Send the rows that are in RowsToSend.

5) Move all the rows from RowsToSend to SentRows.

6) Truncate RowsToSend.

Upvotes: 3

Related Questions