Bat Masterson
Bat Masterson

Reputation: 1184

Efficient ETL Upsert in Redshift

I'm having performance issues with ETLing update-able tables from our OLTP environment to Redshift. Our basic workflow is the typical OLTP->S3->Redshift data flow. Let's say I want to ETL a table like this

create table source_data (
id int primary key,
status varchar,
value decimal(10,4),
dateLastTouched datetime,
dateCreated datetime,
index datelasttouched_index (dateLastTouched));

to a similar table in Redshift. In preparation for the ETL, I made the sort key dateLastTouched and dist key id. We ETL any records with a dateLastTouched after the maximum dateLastTouched ETLd by the previous job.

This setup works very well for tables that don't have old records being updated (e.g. a record from last year changes its status), but when you add in that ability, I can't see anyway to ETL efficiently. The way we currently do it is:

  1. Send all new or updated records to S3, using dateLastTouched.
  2. COPY the data from S3 into a staging table.
  3. Delete any records from the destination table that have the same primary key as the new data.
  4. Insert all records from the staging table.

Given our setup with dateLastTouched as the sort key, step 3 is very very slow. Usually taking 1-2 minutes, and very clearly taking longer as time goes on. We can't change the sort key to the primary key because we need dateLastTouched for reporting queries that run rather frequently on the table. Some ideas we've considered:

  1. Interleaved sort key of id and dateLastTouched. We tried this on another table and the performance gains weren't significant. Also the vacuum reindex times were terrible.
  2. Don't delete, just insert and have periodic jobs materialize the "latest record per id" to another table. This isn't ideal because it practically doubles the space taken up by a large table, and updates aren't frequent.

Is there a better paradigm for efficient upserts from S3 to Redshift? Or do I just need to eat the ETL/materialized-view cost?

Upvotes: 2

Views: 2234

Answers (1)

AlexYes
AlexYes

Reputation: 4208

Another option is to have 2 versions of the table, one sorted by id used for ETL and another sorted by dateLastTouched used for reporting. When ETL process is completed on the first one you just recreate the second (not using order by but just truncate t2, insert into t2 select * from t1 and vacuum reindex t2)

Also, depending on the table size and configuration of your cluster it might be actually faster to reload the entire body of the table without taking care of upsert

Upvotes: 4

Related Questions