Delete rows where ID exists more than once in Redshift

Question

I'm playing around with Redshift for practice. I'm loading data into a Redshift table on a daily basis, and trying to remove duplicates after each ingestion. I initially tried the following to create a new table with distinct records, then deleting the old one.

CREATE TABLE reddit_new AS SELECT DISTINCT * FROM reddit;
ALTER TABLE reddit RENAME TO reddit_old;
ALTER TABLE reddit_new RENAME TO reddit;
DROP TABLE reddit_old;

However I then realised that although some rows have the same ID, there are certain columns that are always different.

So rather than removing duplicate rows, I need to remove rows where the ID is a duplicate. Ideally, I want to keep the record that has the most recent date. If they had the same date, then just remove either or. So in the following example, it would just be row 2 being removed.

ID      Date
34      2022-02-01
23      2022-03-05
12      2022-03-06
23      2022-03-18

I also thought about updating my COPY command to only add records where ID doesn't exist in table, but not sure if that's possible. This is my current COPY command, which runs daily, copying from a new file in S3:

f"COPY public.Reddit FROM '{s3_file}' iam_role '{role_string}' IGNOREHEADER 1 DELIMITER ',' CSV"

Delete rows where ID exists more than once in Redshift

Answers (1)

Related Questions