SQL Server - Delete Duplicate Rows - how does Partition By affect this query?

Question

I've been using the following inherited query where I'm trying to delete duplicate rows and I'm getting some unexpected results when first running it as a SELECT - I believe it has something to do with my lack of understanding of the Partition part of the statement:

WITH CTE AS(
   SELECT [Id],
    [Url],
    [Identifier], 
    [Name], 
    [Entity], 
    [DOB],
       RN = ROW_NUMBER()OVER(PARTITION BY Name ORDER BY Name)
   FROM Data.Statistics
   where Id = 2170
)
DELETE FROM CTE WHERE RN > 1

Can someone help me understand exactly what I'm doing with the Partition BY Name part of this? This doesn't limit the query in any way to only looking for duplicates in the Name field, correct? I need to ensure that it's looking for records where all 5 of the fields inside the CTE definition are the same for a record to be considered a duplicate.

Tim · Accepted Answer

ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Name) doesn't make a lot of sense. You wouldn't ORDER BY the same thing you used in PARTITION BY since it will be the same value for everything in the partition, making the ORDER BY part useless.

Basically the CTE part of this query is saying to split the matching rows (those with [Id] = 2170) temporarily into groups for each distinct name, and within each group of rows with the same name, order those by name (which are obviously all the same value) and then return the row number within that sequence group as RN. Unique names will all have a row number of 1, because there is only one row with that name. Duplicate names will have row numbers 1, 2, 3, and so on. The order of those rows is undefined in this case because of the silly ORDER BY clause, but if you changed the ORDER BY to something meaningful, the row numbers would follow that sequence.

SQL Server - Delete Duplicate Rows - how does Partition By affect this query?

Answers (1)

Related Questions