Reputation: 16887

T-SQL Select duplicate rows where at least 1 column has a value

I have a database table as follows:

ProductDetails
-----------------
ProductDetailsID int
ProductIdentifier VARCHAR (20)
ProductID int
ProductFile VARCHAR(255)
ProductAvailability char(2)
RightsCountry varchar(MAX)
Deleted bit

There was a bug in the platform recently that allowed a large number of duplicates in. So I could have multiple ProductDetails entries which are the same EXCEPT for the ProductDetailsID (PK) and ProductFile (this is null, for some reason the duplicates didn't insert the files).

I need to write a T-SQL script that finds these duplicates with a view to deleting them (after examination).

I have found this online, which is great. It gives me the ProductIdentifier with several records, and the number of duplicates.

SELECT pd.ProductIdentifier, COUNT(pd.ProductIdentifier) AS NumOccurrences
FROM dbo.ProductDetails pd
GROUP BY pd.ProductIdentifier
HAVING ( COUNT(pd.ProductIdentifier) > 1 )

The thing is, some of these records should remain. I need to select the ProductDetail records that have duplicate ProductIdentifiers, where at least 1 of the duplicates has a FileName and all other columns are exactly the same. For example, if I have a dataset as follows:

ProductDetailsID | ProductIdentifier | ProductID | ProductFile | ProductAvailability | RightsCountry | Deleted
123 | 567890 | 12 | filename.png | 20 | AU CX CC CK HM NZ NU NF TK | 0
124 | 567890 | 12 | (NULL) | 20 | AU CX CC CK HM NZ NU NF TK | 0
125 | 567890 | 12 | (NULL) | 20 | AU CX CC CK HM NZ NU NF TK | 0

I need to return ProductDetailsID 124 and 125 as these are for deletion. I'd appreciate any guidance or links to examples or any help at all!

Upvotes: 1

Answers (3)

Damien_The_Unbeliever

Reputation: 239664

This works and is a cheeky way of abusing the partitioned window functions:

declare @ProductDetails table(
ProductDetailsID int not null,
ProductIdentifier VARCHAR(20) not null,
ProductID int not null,
ProductFile VARCHAR(255) null,
ProductAvailability char(2) not null,
RightsCountry varchar(MAX) not null,
Deleted bit not null
)

insert into @ProductDetails(ProductDetailsID,ProductIdentifier,ProductID,
             ProductFile,ProductAvailability,RightsCountry,Deleted) values
(123,567890,12,'filename.png',20,'AU CX CC CK HM NZ NU NF TK',0),
(124,567890,12,NULL,20,'AU CX CC CK HM NZ NU NF TK',0),
(125,567890,12,NULL,20,'AU CX CC CK HM NZ NU NF TK',0)

;With FillInFileNames as (
    select *,
    MAX(ProductFile) OVER (PARTITION BY ProductIdentifier,ProductID,
                                    ProductAvailability,RightsCountry,Deleted)
       as AnyFileName
    from @ProductDetails
)
select * from FillInFileNames
where ProductFile is null and AnyFileName is not null

And the fact that aggregate functions will never return NULL if at least one input value wasn't NULL.

Result:

ProductDetailsID ProductIdentifier    ProductID   ProductFile                                                                                                                                                                                                                                                     ProductAvailability RightsCountry                                                                                                                                                                                                                                                    Deleted AnyFileName
---------------- -------------------- ----------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
124              567890               12          NULL                                                                                                                                                                                                                                                            20                  AU CX CC CK HM NZ NU NF TK                                                                                                                                                                                                                                       0       filename.png
125              567890               12          NULL                                                                                                                                                                                                                                                            20                  AU CX CC CK HM NZ NU NF TK                                                                                                                                                                                                                                       0       filename.png

It may also be instructive for the OP to observe that the top of my script isn't much more than the table information and sample data provided in their question - except mine is actually runnable.

It might be worth considering writing your samples in such a style in the future, because that way it can be immediately copy & pasted from your question into a query window.

Upvotes: 1

dean

Reputation: 10098

create view rows_to_delete
as
select *
from (
    select *, 
    row_number() over(partition by ProductIdentifier order by ProductFile desc, ProductDetailsID) as rn
    from t
) x
where rn > 1
and ProductFile is null

Upvotes: 1

Joseph B

Reputation: 5669

The following query first gets the counts and then filters those records whose fields match with the entry with the fileName.

WITH counts as
(
SELECT 
    ProductDetailsID
    , ProductIdentifier
    , ProductID 
    , ProductFile 
    , ProductAvailability 
    , RightsCountry 
    , Deleted
    , COUNT(*) as cnt_files
FROM
    dbo.ProductDetails pd
GROUP BY 
    ProductDetailsID
    , ProductIdentifier
    , ProductID 
    , ProductFile 
    , ProductAvailability 
    , RightsCountry 
    , Deleted
)
SELECT
    c1.*
FROM
    counts c1
INNER JOIN counts c2
ON c1.ProductIdentifier = c2.ProductIdentifier
AND c1.ProductID = c2.ProductID
AND c1.ProductFile = c2.ProductFile
AND c1.ProductAvailability = c2.ProductAvailability
AND c1.RightsCountry = c2.RightsCountry
AND c1.Deleted = c2.Deleted
WHERE ProductFile is NULL;

Upvotes: 0

T-SQL Select duplicate rows where at least 1 column has a value

Answers (3)

Related Questions