user14380579
user14380579

Reputation:

Delete rows where ID exists more than once in Redshift

I'm playing around with Redshift for practice. I'm loading data into a Redshift table on a daily basis, and trying to remove duplicates after each ingestion. I initially tried the following to create a new table with distinct records, then deleting the old one.

CREATE TABLE reddit_new AS SELECT DISTINCT * FROM reddit;
ALTER TABLE reddit RENAME TO reddit_old;
ALTER TABLE reddit_new RENAME TO reddit;
DROP TABLE reddit_old;

However I then realised that although some rows have the same ID, there are certain columns that are always different.

So rather than removing duplicate rows, I need to remove rows where the ID is a duplicate. Ideally, I want to keep the record that has the most recent date. If they had the same date, then just remove either or. So in the following example, it would just be row 2 being removed.

ID      Date
34      2022-02-01
23      2022-03-05
12      2022-03-06
23      2022-03-18

I also thought about updating my COPY command to only add records where ID doesn't exist in table, but not sure if that's possible. This is my current COPY command, which runs daily, copying from a new file in S3:

f"COPY public.Reddit FROM '{s3_file}' iam_role '{role_string}' IGNOREHEADER 1 DELIMITER ',' CSV"

Upvotes: 0

Views: 857

Answers (1)

Jeremy Fortune
Jeremy Fortune

Reputation: 2499

A common pattern to address this is not to copy into your table directly, but rather first to a (possibly temporary) staging location, then use the data in that table to delete from the primary.

CREATE TABLE staging LIKE "Reddit";
COPY staging FROM '<s3_file>' iam_role '<role>' ignoreheader 1 delimiter ',' csv;
DELETE FROM public."Reddit"
USING staging
WHERE 
  public."Reddit"."ID" = staging."ID"
  AND public."Reddit"."Date" <= staging."Date";
ALTER TABLE public."Reddit" append FROM staging;
DROP TABLE if EXISTS staging;

Here I have not used a temporary just so that alter append can work, but you can use insert into from a temporary table instead.

Upvotes: 0

Related Questions