Hooodie
Hooodie

Reputation: 51

Deleting duplicate rows by certain conditions with SQL

I am trying to delete duplicate names under certain criteria:

  1. their task is null (completely ignore rows with not null tasks)
  2. delete all the duplicates but the newest row

Not the best with SQL but have cobbled this together.

There's only a combined primary key combining first_name and another random column. So ctid is used as a reference to the row from my understanding. I group reoccurring names where task is null and then select only the latest of each group with MAX(date_started). I delete all duplicate rows (a.name = b.name) meeting the criteria that task is null (a.task IS NULL) except the one I had selected by using a.ctid <> b.ctid (b.ctid being the selected newest row by date_started). At least this is what I was hoping to do. It's not actually working.

DELETE FROM player_assignments a 
USING (SELECT MAX(date_started) as ctid, date_started, name, task 
FROM player_assignments WHERE task IS NULL GROUP BY name HAVING COUNT(*) > 1) b 
WHERE a.name = b.name AND a.ctid <> b.ctid AND a.task IS NULL

With different minor tweaks, I keep getting two errors in particular.

psycopg2.errors.GroupingError: column "player_assignments.date_started" must appear in the GROUP BY clause or be used in an aggregate function LINE 1: ...ING (SELECT MAX(date_started) as ctid, da...

psycopg2.errors.UndefinedFunction: operator does not exist: tid <> timestamp with time zone LINE 1: ...me = name AND a.ctid <> b.ctid ...

Upvotes: 1

Views: 101

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1270421

I think I would approach this using a correlated subquery:

delete from player_assignments pa
    where pa.task is null and
          pa.date_started < (select max(pa2.date_started)
                             from player_assignments pa2
                             where pa2.name = pa.name and
                                   pa2.task is null
                            );

Note: This assumes that you want to keep one row with null even if other rows with non-NULL exist for the name. If you want to only keep NULL rows when it is the most recent among all rows, then remove the pa2.task is null from the subquery.

Upvotes: 2

Sergey Geron
Sergey Geron

Reputation: 10212

Try this one:

DELETE FROM player_assignments a 
USING (
  SELECT 
    name,
    MAX(date_started) as ctid 
  FROM player_assignments 
  WHERE task IS NULL
  GROUP BY name
  HAVING COUNT(*) > 1
) b 
WHERE a.name = b.name
  AND a.date_started <> b.ctid
  AND a.task IS NULL

Upvotes: 1

Related Questions