Reputation: 185
I have a situation where I am using data pipeline to import data from csv file stored in S3. For initial data load, data pipeline is executing good.
Now I need to keep this database up-to-date and synced to the in-premise DB. Which mean there will be set of CSV file coming to S3 which would be the updates to some existing records, new records or deletion. I need that to be updated on RDS through data pipeline.
Question - Can data pipeline is designed for such purpose OR is just meant for one-off data load? If it can be used for incremental updates, then how do I go about it.
Any help is much appreciated!
Upvotes: 2
Views: 4768
Reputation: 46
Yes, you need to do an update and insert (aka upsert).
If you have a table with keys: key_a, key_b and other columns: col_c, col_d you can use the following SQL:
insert into TABLENAME (key_a, key_b, col_c, col_d) values (?,?,?,?) ON DUPLICATE KEY UPDATE col_c=values(col_c), col_d=values(col_d)
Upvotes: 3
Reputation: 46
Kindly refer to the aws documentation: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-incrementalcopyrdstos3.html
There is a predefined template for Mysql RDS incremental upload, I personally have tried incremental uploads from mysql, sql server and redshift. You can start with using the mysql template and edit it in architect view to get a insight of the new/additional fiels it uses and likewise create datapipeline for other RDS database as well.
Internally the Incremental requires you to provide the change column which needs to be essentially a date column, and it this changecolumn is them used in the Sql script which is like:
select * from #{table} where #{myRDSTableLastModifiedCol} >= '#{format(@scheduledStartTime, 'YYYY-MM-dd HH-mm-ss')}' and #{myRDSTableLastModifiedCol} <= '#{format(@scheduledEndTime, 'YYYY-MM-dd HH-mm-ss')}'
scheduledStartTime and scheduleEndTime are the datapipeline expression whose value depends upon your schedule. http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-pipeline-expressions.html
and scheduletype is timeseries to execute the sql at the end of the schedule end time to guarrante that there is no data loss.
And yes deleted data cant be tracked in through datapipeline; also datapipleline would also not help if the datetime column is not there in your table, in which case i wlould prefer loading full table.
I hope i have covered pretty-much i know :)
Regards, Varun R
Upvotes: 2