Reputation:
I'm playing around with Redshift for practice. I'm loading data into a Redshift table on a daily basis, and trying to remove duplicates after each ingestion. I initially tried the following to create a new table with distinct records, then deleting the old one.
CREATE TABLE reddit_new AS SELECT DISTINCT * FROM reddit;
ALTER TABLE reddit RENAME TO reddit_old;
ALTER TABLE reddit_new RENAME TO reddit;
DROP TABLE reddit_old;
However I then realised that although some rows have the same ID, there are certain columns that are always different.
So rather than removing duplicate rows, I need to remove rows where the ID is a duplicate. Ideally, I want to keep the record that has the most recent date. If they had the same date, then just remove either or. So in the following example, it would just be row 2 being removed.
ID Date
34 2022-02-01
23 2022-03-05
12 2022-03-06
23 2022-03-18
I also thought about updating my COPY command to only add records where ID doesn't exist in table, but not sure if that's possible. This is my current COPY command, which runs daily, copying from a new file in S3:
f"COPY public.Reddit FROM '{s3_file}' iam_role '{role_string}' IGNOREHEADER 1 DELIMITER ',' CSV"
Upvotes: 0
Views: 857
Reputation: 2499
A common pattern to address this is not to copy into your table directly, but rather first to a (possibly temporary) staging location, then use the data in that table to delete from the primary.
CREATE TABLE staging LIKE "Reddit";
COPY staging FROM '<s3_file>' iam_role '<role>' ignoreheader 1 delimiter ',' csv;
DELETE FROM public."Reddit"
USING staging
WHERE
public."Reddit"."ID" = staging."ID"
AND public."Reddit"."Date" <= staging."Date";
ALTER TABLE public."Reddit" append FROM staging;
DROP TABLE if EXISTS staging;
Here I have not used a temporary just so that alter append
can work, but you can use insert into
from a temporary table instead.
Upvotes: 0