There is a lot of (older) related data spread over many tables in our database that needs archiving to a separate database then deleting from the original database (probably daily). What is archived (and deleted) is determined by business rules. We don't want newer data in this archive (it changes and is accessed frequently). Some of the archived data may need transferring back if required and possibly reported on. I need to come up with an efficient and easy to maintain solution in SSIS. I can think of at least three ways of doing this: write a big select statement with lots of left joins to get all the data that I want, then go through this data either in memory or store it in a staging table work on a table by table basis, selecting the jobs, then the related data from other tables similar to the first method, except just dump the data into a de-normalised table (seems like a lazy method?) I considered staging tables, though I don't see a benefit in this case - I can copy the data directly to the archive table with a timestamp (timestamp = process started) then when finished, go and delete the id's that were timestamped at the time the process started. Once I've identified the job id, I know the rest of the data that needs to be archived too. I'm looking for the optimal approach, does anyone have another approach? How would other people achieve this?

Sounds like you need two processes, one to archive old data, and one to reload archived data. I’d tackle both with the following methodology. For Archiving data: First, go through and, using your business rules, identify the data that needs to be archived. By and large, this means building a temporary table (or data set) containing only the keys of the data to be archived. Using invoices as an example, I’d collect just the IDs of the Invoices to be archived, and that should be sufficient to identify all the related child data (invoice line items, shipping and payment information, etc. etc.) You might need multiple such sets, such as for archiving both invoices, customers, and sales people, since none are “fully contained” within another. Be certain these sets are whole and complete (that is, they contain no broken parent/child relationships). Basing all subsequent work on these data sets will ensure that nothing extra inadvertently “slips in”. Next, go through and copy the data identified within these sets from the source database to the archive database. When all the data has been properly copied over (and only then), go back and delete this data from the source tables. For Reloading data, it’d be pretty much the same process, but working from the archive set to the “live” set. Special considerations would include: If reloaded, can the data be modified? If so, then you’d presumably need to delete it from the archive since, when it eventually is re-archived, it may have been modified. That, or you’ll have to allow for archiving the same set of data twice! You’d presumably want to flag reloaded data somehow, so that it does not get immediately re-archived by the next archive run. This methodology would work regardless of tool—SSIS or otherwise.

databasessisarchivedata-warehouse

Mr Shoubs

Reputation: 15419

How to archive related data (in different tables) using SSIS

There is a lot of (older) related data spread over many tables in our database that needs archiving to a separate database then deleting from the original database (probably daily). What is archived (and deleted) is determined by business rules. We don't want newer data in this archive (it changes and is accessed frequently). Some of the archived data may need transferring back if required and possibly reported on.

I need to come up with an efficient and easy to maintain solution in SSIS. I can think of at least three ways of doing this:

write a big select statement with lots of left joins to get all the data that I want, then go through this data either in memory or store it in a staging table
work on a table by table basis, selecting the jobs, then the related data from other tables
similar to the first method, except just dump the data into a de-normalised table (seems like a lazy method?)

I considered staging tables, though I don't see a benefit in this case - I can copy the data directly to the archive table with a timestamp (timestamp = process started) then when finished, go and delete the id's that were timestamped at the time the process started. Once I've identified the job id, I know the rest of the data that needs to be archived too.

I'm looking for the optimal approach, does anyone have another approach? How would other people achieve this?

Upvotes: 2

Answers (3)

Aram

Reputation: 716

Use CQRS. The problem is in the term "related data". If you segregate your reads (all possible reads in own database/tables) you will not have a need for related data and can apply rules to each "Aggregate" separately if necessary. Let's say for list views you an get count from two different sources (actual and archived data) when you are rendering pager. For detail view you still can have UUID's for each resource but in this case application will read from different data stores. The key point here is you get rid of all kind of joins.

Upvotes: 0

Philip Kelley

Reputation: 40359

Sounds like you need two processes, one to archive old data, and one to reload archived data. I’d tackle both with the following methodology.

For Archiving data:

First, go through and, using your business rules, identify the data that needs to be archived. By and large, this means building a temporary table (or data set) containing only the keys of the data to be archived. Using invoices as an example, I’d collect just the IDs of the Invoices to be archived, and that should be sufficient to identify all the related child data (invoice line items, shipping and payment information, etc. etc.) You might need multiple such sets, such as for archiving both invoices, customers, and sales people, since none are “fully contained” within another.
Be certain these sets are whole and complete (that is, they contain no broken parent/child relationships). Basing all subsequent work on these data sets will ensure that nothing extra inadvertently “slips in”.
Next, go through and copy the data identified within these sets from the source database to the archive database.
When all the data has been properly copied over (and only then), go back and delete this data from the source tables.

For Reloading data, it’d be pretty much the same process, but working from the archive set to the “live” set. Special considerations would include:

If reloaded, can the data be modified? If so, then you’d presumably need to delete it from the archive since, when it eventually is re-archived, it may have been modified. That, or you’ll have to allow for archiving the same set of data twice!
You’d presumably want to flag reloaded data somehow, so that it does not get immediately re-archived by the next archive run.

This methodology would work regardless of tool—SSIS or otherwise.

Upvotes: 6

sabyasm

Reputation: 56

Could you please be more specific?

Do you need the 'old' data still present in your current database? If yes - then you can simply:

Backup and restore
Write one off 'efficient' delete
Create new SSIS package that will maintain future data population/maintain

If you don't need the 'old' data in the existing database, inserting the required archive data to new database will be the way to go. Depending what you need in the archive database - if you don't need the non-key you can go for normalization otherwise table by table approach will be good.

if this helps please mark as answer

Upvotes: 0

How to archive related data (in different tables) using SSIS

Answers (3)

Related Questions