Reputation: 16887
I have a database table as follows:
ProductDetails
-----------------
ProductDetailsID int
ProductIdentifier VARCHAR (20)
ProductID int
ProductFile VARCHAR(255)
ProductAvailability char(2)
RightsCountry varchar(MAX)
Deleted bit
There was a bug in the platform recently that allowed a large number of duplicates in. So I could have multiple ProductDetails entries which are the same EXCEPT for the ProductDetailsID (PK) and ProductFile (this is null, for some reason the duplicates didn't insert the files).
I need to write a T-SQL script that finds these duplicates with a view to deleting them (after examination).
I have found this online, which is great. It gives me the ProductIdentifier with several records, and the number of duplicates.
SELECT pd.ProductIdentifier, COUNT(pd.ProductIdentifier) AS NumOccurrences
FROM dbo.ProductDetails pd
GROUP BY pd.ProductIdentifier
HAVING ( COUNT(pd.ProductIdentifier) > 1 )
The thing is, some of these records should remain. I need to select the ProductDetail records that have duplicate ProductIdentifiers, where at least 1 of the duplicates has a FileName and all other columns are exactly the same. For example, if I have a dataset as follows:
ProductDetailsID | ProductIdentifier | ProductID | ProductFile | ProductAvailability | RightsCountry | Deleted
123 | 567890 | 12 | filename.png | 20 | AU CX CC CK HM NZ NU NF TK | 0
124 | 567890 | 12 | (NULL) | 20 | AU CX CC CK HM NZ NU NF TK | 0
125 | 567890 | 12 | (NULL) | 20 | AU CX CC CK HM NZ NU NF TK | 0
I need to return ProductDetailsID 124 and 125 as these are for deletion. I'd appreciate any guidance or links to examples or any help at all!
Upvotes: 1
Views: 939
Reputation: 239664
This works and is a cheeky way of abusing the partitioned window functions:
declare @ProductDetails table(
ProductDetailsID int not null,
ProductIdentifier VARCHAR(20) not null,
ProductID int not null,
ProductFile VARCHAR(255) null,
ProductAvailability char(2) not null,
RightsCountry varchar(MAX) not null,
Deleted bit not null
)
insert into @ProductDetails(ProductDetailsID,ProductIdentifier,ProductID,
ProductFile,ProductAvailability,RightsCountry,Deleted) values
(123,567890,12,'filename.png',20,'AU CX CC CK HM NZ NU NF TK',0),
(124,567890,12,NULL,20,'AU CX CC CK HM NZ NU NF TK',0),
(125,567890,12,NULL,20,'AU CX CC CK HM NZ NU NF TK',0)
;With FillInFileNames as (
select *,
MAX(ProductFile) OVER (PARTITION BY ProductIdentifier,ProductID,
ProductAvailability,RightsCountry,Deleted)
as AnyFileName
from @ProductDetails
)
select * from FillInFileNames
where ProductFile is null and AnyFileName is not null
And the fact that aggregate functions will never return NULL
if at least one input value wasn't NULL
.
Result:
ProductDetailsID ProductIdentifier ProductID ProductFile ProductAvailability RightsCountry Deleted AnyFileName
---------------- -------------------- ----------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
124 567890 12 NULL 20 AU CX CC CK HM NZ NU NF TK 0 filename.png
125 567890 12 NULL 20 AU CX CC CK HM NZ NU NF TK 0 filename.png
It may also be instructive for the OP to observe that the top of my script isn't much more than the table information and sample data provided in their question - except mine is actually runnable.
It might be worth considering writing your samples in such a style in the future, because that way it can be immediately copy & pasted from your question into a query window.
Upvotes: 1
Reputation: 10098
create view rows_to_delete
as
select *
from (
select *,
row_number() over(partition by ProductIdentifier order by ProductFile desc, ProductDetailsID) as rn
from t
) x
where rn > 1
and ProductFile is null
Upvotes: 1
Reputation: 5669
The following query first gets the counts and then filters those records whose fields match with the entry with the fileName.
WITH counts as
(
SELECT
ProductDetailsID
, ProductIdentifier
, ProductID
, ProductFile
, ProductAvailability
, RightsCountry
, Deleted
, COUNT(*) as cnt_files
FROM
dbo.ProductDetails pd
GROUP BY
ProductDetailsID
, ProductIdentifier
, ProductID
, ProductFile
, ProductAvailability
, RightsCountry
, Deleted
)
SELECT
c1.*
FROM
counts c1
INNER JOIN counts c2
ON c1.ProductIdentifier = c2.ProductIdentifier
AND c1.ProductID = c2.ProductID
AND c1.ProductFile = c2.ProductFile
AND c1.ProductAvailability = c2.ProductAvailability
AND c1.RightsCountry = c2.RightsCountry
AND c1.Deleted = c2.Deleted
WHERE ProductFile is NULL;
Upvotes: 0