Updating specific Duplicate entries

Question

In my SQL tables, I am attempting to write a stored procedure that will look for duplicates in each table and update a specific column true or false if it finds one. This is a bit more complicated due to foreign key use and I'm not sure on the best way to write it.

Goal: In table A, update column 'isDuplicates' (all start as FALSE) to be TRUE if duplicate criteria is met

Foreign Key: Every table in my database contains data from source materials, each of which has a release date. There is a Sources table that contains a SourceDate. Every other table has a Sources column with a foreign key that maps to its entry on the Sources table, allowing me to access the date.

Duplicate Criteria: In table A, an entry is a duplicate if there is any other entry (different primary key 'Id') with the same name (column 'Name'), and the other entry was released later (foreign key.SourceDate is greater than the current one)

Ultimately, I'm trying to end up with a table where 'isDuplicate' = TRUE means that there is a duplicate item with same name/different Id, AND the other item was released later. So if something is a duplciate but is, itself, the most recent version, that should say 'isDuplicate'=FALSE

Hopefully this makes sense. Thank you all in advance.

Edit: Updating with an example.

TABLE A has columns NAME, SOURCE, and ISDUPLICATE

Frank 71 False
Tom 71 False
Tom 83 False
Richard 83 False
Frank 13 False

SOURCES table has columns ID and SOURCEDATE

13 2/1/2010
71 5/1/2014
83 8/1/2014

After running the stored procedure on table A, I would expect the following results:

Frank 71 False
Tom 71 True
Tom 83 False
Richard 83 False
Frank 13 True

Both Frank (Source 13) and Tom (71) get marked TRUE as duplicates as they are older versions of another existing entry. I leave Tom (83) and Frank (71) as FALSE as they are the primary entry I want to display.

The purpose for not just removing the duplicates is to allow the users functionality - by default, they'll only see one entry for each thing, and it will always be the most recently released thing. However, I'll also allow them to see everything if they want (basically default to only pulling data with isDuplicate=FALSE, but allow them to turn it off and show duplicates/legacy data).

GMB · Accepted Answer

I think that the following query should give you what you want. It works by joining tables tableA and Sources. An inline correlated subquery can be used to check if the record is considered a duplicate :

SELECT t.Name, t.Source, CASE WHEN EXISTS (
    SELECT 1 
    FROM tableA t1 
    INNER JOIN Sources s1 ON s1.ID = t1.Source
    WHERE t1.Name = t.Name AND s1.SourceDate < s.SourceDate
    ) THEN 1 ELSE 0 END AS isDuplicate
FROM tableA
INNER JOIN Sources s ON s.ID = t.Source

Column isDuplicate will have value 0 (not a duplicate) or 1 (duplicate)

With SQLServer 2017, use window function ROW_NUMBER() :

SELECT 
    t.Name, 
    t.Source, 
    CASE 
        WHEN ROW_NUMBER() OVER(PARTITION BY t.Name ORDER BY s.SourceDate DESC) = 1 THEN 0
        ELSE 1
    END AS isDuplicate
FROM tableA t
INNER JOIN Sources s ON s.ID = t.Source

You can put this in a CTE a turn it to an update query :

WITH cte AS (
    SELECT 
        t.*,
        ROW_NUMBER() OVER(PARTITION BY t.Name ORDER BY s.SourceDate DESC) AS rn
    FROM tableA t
    INNER JOIN Sources s ON s.ID = t.Source
) UPDATE cte SET isDuplicate = IIF(rn = 1, 0, 1)

Updating specific Duplicate entries

Answers (1)

Related Questions