Reputation: 87
I am loading CSV files to Redshift daily. To handle duplicates i am loading the files to staging table and then using Update or Insert scripts based on key columns to load to the target table. Recently i found duplicate data in the target table unexpectedly.
I double checked my script and don't see any reason for having duplicates. Below are the Update and Insert script formats that i am using.
For Inserting:
Insert into target (key1, key2, col3, col4)
Select key1, key2, col3, col4
From stage s where not exists (select 1 from target t
where s.key1 = t.key1 and)
s.key2 = t.key2);
And for update:
Update target Set
key1=s.key1, key2=s.key2, col3=s.col3, col4=s.col4
From stage s where target.key1=s.key1 and target.key2=s.key2;
Any help is appreciated.
Upvotes: 0
Views: 1420
Reputation: 2828
I ran into this too. The problem was in the insert...select... where the select itself produced duplicates. One solution for us was to use a cursor (outside of Redshift) to run the select and insert one record at a time, but this proved to have performance issues. Instead we now check for duplicates with an initial select
select key1,key2 from stage group by key1,key2 having count(*) > 1;
and stop the process if records are returned.
Upvotes: 2