Reputation: 1184
I'm having performance issues with ETLing update-able tables from our OLTP environment to Redshift. Our basic workflow is the typical OLTP->S3->Redshift data flow. Let's say I want to ETL a table like this
create table source_data (
id int primary key,
status varchar,
value decimal(10,4),
dateLastTouched datetime,
dateCreated datetime,
index datelasttouched_index (dateLastTouched));
to a similar table in Redshift. In preparation for the ETL, I made the sort key dateLastTouched
and dist key id
. We ETL any records with a dateLastTouched after the maximum dateLastTouched ETLd by the previous job.
This setup works very well for tables that don't have old records being updated (e.g. a record from last year changes its status), but when you add in that ability, I can't see anyway to ETL efficiently. The way we currently do it is:
Given our setup with dateLastTouched as the sort key, step 3 is very very slow. Usually taking 1-2 minutes, and very clearly taking longer as time goes on. We can't change the sort key to the primary key because we need dateLastTouched for reporting queries that run rather frequently on the table. Some ideas we've considered:
Is there a better paradigm for efficient upserts from S3 to Redshift? Or do I just need to eat the ETL/materialized-view cost?
Upvotes: 2
Views: 2234
Reputation: 4208
Another option is to have 2 versions of the table, one sorted by id
used for ETL and another sorted by dateLastTouched
used for reporting. When ETL process is completed on the first one you just recreate the second (not using order by
but just truncate t2
, insert into t2 select * from t1
and vacuum reindex t2
)
Also, depending on the table size and configuration of your cluster it might be actually faster to reload the entire body of the table without taking care of upsert
Upvotes: 4