DW Brock
DW Brock

Reputation: 1

SQL. Delete duplicate records based on multiple criteria

I have a TABLE called Food with a few thousand records. Each record is a food consisting of a Description and 5 categories. Each succeeding category is more specific than the last. Description, cat1, and cat2 cannot accept NULL values. Cat3, cat4, and cat5 can accept NULL values.

For example:

description | cat1 | cat2    | cat3  | cat4      | cat5
Green apple | Food | produce | NULL  | apple     | NULL
Root beer   | Bev  | Non-Alc | NULL  | Root beer | NULL

The issue I'm having is that there are lots of duplicate descriptions assigned different categories. And I need help writing a code that can delete these duplicates by certain conditions.

Example:

description    | cat1 | cat2     | cat3     | cat4  | cat5
Mango Syrup    | Food | Produce  | NULL     | Mango | NULL
Mango Syrup    | Food | Dry Good | NULL     | NULL  | Syrup
Pepperoni Pizza| Food | Meat     | Pepperoni| NULL  | NULL
Pepperoni Pizza| Food | Bakery   | NULL     | Pizza | NULL

I have hundreds of these, and luckily, they're all set up the same way. Out of the three columns cat3, 4, and 5, only one can be populated. If a record's cat5 IS NOT NULL, cat3 and cat4 will be NULL. If a record's cat4 IS NOT NULL, cat3 and cat5 will be NULL. If a record's cat3 IS NOT NULL, cat4 and cat5 will be NULL.

The correct duplicate is the item whose record has the most specific category populated. All other duplicates should be deleted.

Things I've tried: GROUP BY and using MIN() or MAX() to aggregate different categories. This doesn't give me the result I want.

Anything will help. Appreciate it.

Upvotes: 0

Views: 910

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269633

So for each description, you want the row that has the "biggest" category filled.

One method uses row_number():

select t.*
from (select t.*,
             row_number() over (partition by description
                                order by case when cat5 is null then 1
                                              when cat4 is null then 2
                                              when cat3 is null then 3
                                              when cat2 is null then 4
                                              when cat1 is null then 5
                                         end
                                ) as seqnum
      from t
     ) t
where seqnum = 1;

If you actually want to delete the other rows, then use an updatable CTE:

with todelete as (
      select t.*
      from (select t.*,
                   row_number() over (partition by description
                                      order by case when cat5 is null then 1
                                                    when cat4 is null then 2
                                                    when cat3 is null then 3
                                                    when cat2 is null then 4
                                                    when cat1 is null then 5
                                               end
                                      ) as seqnum
            from t
           ) t
      )
delete from todelete
    where seqnum > 1;

Upvotes: 1

Related Questions