Dirjit
Dirjit

Reputation: 87

Update or Insert based on key columns in Redshift

I am loading CSV files to Redshift daily. To handle duplicates i am loading the files to staging table and then using Update or Insert scripts based on key columns to load to the target table. Recently i found duplicate data in the target table unexpectedly.

I double checked my script and don't see any reason for having duplicates. Below are the Update and Insert script formats that i am using.

For Inserting:

      Insert into target (key1, key2, col3, col4)
      Select key1, key2, col3, col4 
      From stage s where not exists (select 1 from target t
                        where s.key1 = t.key1 and)
                        s.key2 = t.key2);

And for update:

      Update target Set
          key1=s.key1, key2=s.key2, col3=s.col3, col4=s.col4
      From stage s where target.key1=s.key1 and target.key2=s.key2;

Any help is appreciated.

Upvotes: 0

Views: 1420

Answers (1)

mike_pdb
mike_pdb

Reputation: 2828

I ran into this too. The problem was in the insert...select... where the select itself produced duplicates. One solution for us was to use a cursor (outside of Redshift) to run the select and insert one record at a time, but this proved to have performance issues. Instead we now check for duplicates with an initial select

select key1,key2 from stage group by key1,key2 having count(*) > 1;

and stop the process if records are returned.

Upvotes: 2

Related Questions