Reputation: 389

Eliminate duplicates based on Order By SQL

I have a table with more than 20 columns and more the 300.000 rows out of which the relevant columns for my question are the following:

column1 |column2 |column3 | date
   123  |  657   | 2222   | 20 dec
   123  |  658   | 2222   | Null
   124  |  543   | 3333   | Null
   124  |  544   | 3333   | 1 Feb
   125  |  098   | 4444   | Null
   125  |  065   | 4444   | Null

I did the following query to sort them as I needed:

select * 
from table
order by column1,column2,date desc;

The result is:

column1 |column2 |column3 | date
   123  |  657   | 2222   | 20 dec
   123  |  658   | 2222   | Null
   124  |  544   | 3333   | 1 Feb
   124  |  543   | 3333   | Null
   125  |  065   | 4444   | Null
   125  |  098   | 4444   | Null

The main purpose is to delete the rows where based on column1 are duplicates if they have a date that is not null. If they both have Null then I sould be able to keep just one. So what I am looking for is this:

column1 |column2 |column3 | date
   123  |  657   | 2222   | 20 dec
   124  |  544   | 3333   | 1 Feb
   125  |  065   | 4444   | Null

Can someone help me to develop this query. I wish to retreive ALL of the other columns not just the ones displayed.

I am using Azure SQL which is basically SQL Server

I tried doing this:

SELECT * FROM table AS main
WHERE main.column4 IN
(SELECT sub.MinID FROM
   (SELECT column1, column3, min(column4) As MinID
    FROM table
    GROUP BY column1, column3) AS sub)
ORDER BY main.column1, main.column3;

The problem with this is that since the rule is min(column4) when I have a Not Null whose column4 value is less than the other it choose the NULL. I tried putting min(date) but I get this error: Arithmetic overflow error converting expression to data type smalldatetime I am guessing because of the Null values.

How can input the rule that if one is Not null and the other is Null then choose the Not Null

Upvotes: 2

Answers (3)

Vladimir Baranov

Reputation: 32685

Looks like classic top-n-per-group. One way to do it is to use ROW_NUMBER:

WITH
CTE
AS
(
    SELECT
        column1
        ,column2
        ,column3
        ,dt
        ,ROW_NUMBER() OVER (PARTITION BY column1 ORDER BY dt desc, column2) AS rn
    FROM tbl
)
SELECT
    column1
    ,column2
    ,column3
    ,dt
FROM CTE
WHERE rn = 1
ORDER BY
    column1
;

PARTITION BY column1 means that numbering of rows restarts when column1 changes. ORDER BY dt desc, column2 determines the order when numbering the rows. In the end you just pick the first row of the partition using WHERE rn = 1.

The final result will have only one row per column1, so adding more columns to the final ORDER BY doesn't change anything.

Upvotes: 4

Owain Esau

Reputation: 1922

CTE is probably your best bet:

WITH CTE
AS (
    SELECT column1
        ,column2
        ,column3
        ,DATE
        ,ROW_NUMBER() OVER (
            PARTITION BY column1 ORDER BY 
                date
                ,column1
                ,column2
                ,column3 DESC
            ) AS RN
    FROM table
    )
DELETE
FROM CTE
WHERE RN <> 1

To check that what you are deleting is the correct data, just change DELETE to SELECT * at the end.

Upvotes: 0

Charles Bretana

Reputation: 146419

Try this:

Select * from table t
Where (date is not null
    and not exists 
     (select * from table 
      where date is Null
         and column1 = t.Column1)) Or
   column2 = 
     (Select Max(column2) from table
      Where column1 = t.Column1)

Upvotes: 0

Eliminate duplicates based on Order By SQL

Answers (3)

Related Questions