SimonB
SimonB

Reputation: 1055

AWS DMS CDC to S3 target

So I was playing around seeing what could be achieved using Data Migration Service Chance Data Capture, to take data from MSSQL to S3 & also Redshift.

The redshift testing was fine, if I delete a record in my source DB, a second or two later the record disappears from Redshift. Same with Insert/update etc ..

But S3 ... You get the original record from the first full load. Then if you update a record in source, S3 receives a new copy of the record, marked with an 'I'.
If I delete a record, I get another copy of the record marked with a 'D'.

So my question is - what do I do with all this ? How would I query my S3 bucket to see the 'current' state of my data set as reflecting the source database ?

Do I have to script some code myself to pick up all these files and process them, performing the inserts/updates and deletes until I finally resolve back to a 'normal' data set ?

Any insight welcomed !

Upvotes: 4

Views: 4059

Answers (2)

Mradul Yd
Mradul Yd

Reputation: 131

As explained here about what 'I', 'U' and 'D' means.

What we do to get the current state of the db? An alternate is to first of all add this additional column to the fullload files as well, i.e. the initial loaded files before CDC should also have this additional column. How? Now query the data in athena in such a way where we exclude the records where Op not in ("D", "U") or AR_H_OPERATION NOT IN ("DELETE", "UPDATE"). Thus you get the correct count (ONLY COUNT as 'U' would only come if there is already an I for that entry).

SELECT count(*) FROM "database"."table_name"
WHERE Op NOT IN ('D','U')

Also to get all the records you may try something in athena with a complex sql, where Op not in ('D') and records when Op IN = 'I' and count 1 or else if count 2, pick the latest one or Op = 'U'.

Upvotes: 0

user253684
user253684

Reputation: 71

The records containing 'I', 'D' or 'U' are actually CDC data (change data capture). This is called sometimes called "history" or "historical data". This type of data has some applications in data warehousing and also this can also be used in many Machine learning uses cases.

Now coming to the next point, in order to get 'current' state of data set you have to script/code yourself. You can use AWS Glue to perform the task. For-example, This post explains something similar.

If you do not want to maintain the glue code, then a shortcut is not to use s3 target with DMS directly, but use Redshift target and once all CDC is applied offload the final copy to S3 using Redshift unload command.

Upvotes: 4

Related Questions