Reputation: 636
I have a RDS database whose snapshot is taken everyday and is kept in a S3 bucket. I copy the RDS snapshot data from S3 to Amazon Redshift database daily. I can use copy
to copy the tables but instead of copying the whole table, I want to copy only the rows which were added since the last snapshot was taken(Incremental copying).
For example, in RDS, there is a table name "user" which looks like this at 25-05-2021
id | username
1 | john
2 | cathy
When I will run the data loader for first time on 26-05-2021, it will copy these two rows into the Redshift table with the same name.
Now on 26-05-2021, the table in RDS looks like this:
id | username
1 | john
2 | cathy
3 | ola
4 | mike
When I will run the data loader on 27-05-2021, instead of copying all three rows, I want to copy/take only the rows which has been newly added(id = 3 and id = 4
) as I already have the other rows.
What should be the best way of doing this incremental loading?
Upvotes: 2
Views: 2560
Reputation: 270224
The COPY command will always load the entire table. However, you could create an External Table using Redshift Spectrum that accesses the files without loading them into Redshift. Then, you could construct a query that does an INSERT where the ID is greater than the last ID used in the Redshift table.
Perhaps I should explain it a bit simpler...
existing_table
in Redshift already has rows up to id = 2
CREATE EXTERNAL TABLE in_data
to point at the files in S3 containing the dataINSERT INTO existing_table SELECT * FROM in_data WHERE id > (SELECT MAX(id) FROM existing_table
In theory, this should only load the new rows into the table.
Upvotes: 1