CorribView
CorribView

Reputation: 741

Update all but one of duplicate records in table in SQL Server

I have a SQL Server table that has duplicate entries in one of the columns (object_id) e.g.:

+----+-----------+------------+
| id | object_id | status_val |
+----+-----------+------------+
|  1 |         1 |          0 | 
|  2 |         1 |          0 | 
|  3 |         1 |          0 | 
|  4 |         2 |          0 | 
|  5 |         3 |          0 | 
|  6 |         3 |          0 | 
+----+-----------+------------+

I need to update all of their statuses except for one when there is duplication in the object_id column. So in the table above object_id 1 and 3 are duplicated. So I would want to change their status_val to 2, except one of the entries. The result would look like:

| id | object_id | status_val |
+----+-----------+------------+
|  1 |         1 |          0 | 
|  2 |         1 |          2 | 
|  3 |         1 |          2 | 
|  4 |         2 |          0 | 
|  5 |         3 |          0 | 
|  6 |         3 |          2 | 
+----+-----------+------------+

It doesn't matter which one of the duplicated rows has it's status updated.

Any help would be appreciated.

Upvotes: 2

Views: 16047

Answers (3)

DragonMoon
DragonMoon

Reputation: 411

You may solve this problem without a join, which means it should have better performance. The idea is to group the data by your object_id, counting the row number of each object_id. This is what "partition by" does. Then you can update where the row_num is > 1. This will update all duplicated object_id except the first one!

update t set t.status_val = 'some_status' 
from (
    select *, row_number() over(partition by object_id order by (select null)) row_num  
    from foo
) t 
where row_num > 1 

On a test table of 82944 records, the performance was such (your mileage may vary!): Table 'test'. Scan count 5, logical reads 82283, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. CPU time = 141 ms, elapsed time = 150 ms.

We can certainly also solve this problem by using an inner join, however, in general this should lead to more logical reads and higher CPU:

Table 'test'. Scan count 10, logical reads 83622, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 4, logical reads 167426, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. CPU time = 342 ms, elapsed time = 233 ms.

To loop over the results and update in smaller batches:

declare @rowcount int = 1;
declare @batch_size int = 1000;

while @rowcount > 0 
begin
    update top(@batch_size) t set t.status_val = 'already updated'
    from (
        select *, row_number() over(partition by object_id order by (select null)) row_num  
        from foo
        where status_val <> 'already updated' 
    ) t 
    where row_num > 1 
    set @rowcount = @@rowcount;
end

This will help keep locking down if other concurrent sessions are trying to access this table.

Upvotes: 5

Erran Morad
Erran Morad

Reputation: 4743

According to your question, it seems that for each value of object_id, you want to keep the status_val = 0 for the object_id with lowest id and = 2 for the others. If that is indeed the case, and IF an object_id repeats only a maximum of 3 times, then I have a very simple solution for you. Use the modulo or remainder operator to get what you want. Here is the answer which I will explain later:

update [MyTable]
set status_val = 2
where (id%3) != 1

When you divide any value of id by 3, the remainder can be only 0,1 or 2. So, for each object_id where id%3 is not 1, we change the status_val to 2.

Before executing the above code, see the output of this query -

select id, (id%3) as flg, object_id, status_val 
from MyTable

Upvotes: 1

bf2020
bf2020

Reputation: 732

UPDATE Table
SET Table.status_val = '2'
FROM Table
INNER JOIN
(SELECT id, row_number()OVER(PARTITION BY object_id ORDER BY id) as seq FROM Table) other_table
ON Table.id = other_table.id AND seq <> 1

Upvotes: 1

Related Questions