Rafiul Sabbir
Rafiul Sabbir

Reputation: 636

Daily incremental copying from Amazon S3 data into Amazon Redshift

I have a RDS database whose snapshot is taken everyday and is kept in a S3 bucket. I copy the RDS snapshot data from S3 to Amazon Redshift database daily. I can use copy to copy the tables but instead of copying the whole table, I want to copy only the rows which were added since the last snapshot was taken(Incremental copying).

For example, in RDS, there is a table name "user" which looks like this at 25-05-2021

id | username
1  | john
2  | cathy

When I will run the data loader for first time on 26-05-2021, it will copy these two rows into the Redshift table with the same name.

Now on 26-05-2021, the table in RDS looks like this:

id | username
1  | john
2  | cathy
3  | ola
4  | mike

When I will run the data loader on 27-05-2021, instead of copying all three rows, I want to copy/take only the rows which has been newly added(id = 3 and id = 4) as I already have the other rows.

What should be the best way of doing this incremental loading?

Upvotes: 2

Views: 2560

Answers (1)

John Rotenstein
John Rotenstein

Reputation: 270224

The COPY command will always load the entire table. However, you could create an External Table using Redshift Spectrum that accesses the files without loading them into Redshift. Then, you could construct a query that does an INSERT where the ID is greater than the last ID used in the Redshift table.

Perhaps I should explain it a bit simpler...

  • Table existing_table in Redshift already has rows up to id = 2
  • CREATE EXTERNAL TABLE in_data to point at the files in S3 containing the data
  • The use INSERT INTO existing_table SELECT * FROM in_data WHERE id > (SELECT MAX(id) FROM existing_table

In theory, this should only load the new rows into the table.

Upvotes: 1

Related Questions