Reputation: 741
I have a SQL Server table that has duplicate entries in one of the columns (object_id
) e.g.:
+----+-----------+------------+
| id | object_id | status_val |
+----+-----------+------------+
| 1 | 1 | 0 |
| 2 | 1 | 0 |
| 3 | 1 | 0 |
| 4 | 2 | 0 |
| 5 | 3 | 0 |
| 6 | 3 | 0 |
+----+-----------+------------+
I need to update all of their statuses except for one when there is duplication in the object_id
column. So in the table above object_id
1 and 3 are duplicated. So I would want to change their status_val
to 2, except one of the entries. The result would look like:
| id | object_id | status_val |
+----+-----------+------------+
| 1 | 1 | 0 |
| 2 | 1 | 2 |
| 3 | 1 | 2 |
| 4 | 2 | 0 |
| 5 | 3 | 0 |
| 6 | 3 | 2 |
+----+-----------+------------+
It doesn't matter which one of the duplicated rows has it's status updated.
Any help would be appreciated.
Upvotes: 2
Views: 16047
Reputation: 411
You may solve this problem without a join, which means it should have better performance. The idea is to group the data by your object_id, counting the row number of each object_id. This is what "partition by" does. Then you can update where the row_num is > 1. This will update all duplicated object_id except the first one!
update t set t.status_val = 'some_status'
from (
select *, row_number() over(partition by object_id order by (select null)) row_num
from foo
) t
where row_num > 1
On a test table of 82944 records, the performance was such (your mileage may vary!): Table 'test'. Scan count 5, logical reads 82283, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. CPU time = 141 ms, elapsed time = 150 ms.
We can certainly also solve this problem by using an inner join, however, in general this should lead to more logical reads and higher CPU:
Table 'test'. Scan count 10, logical reads 83622, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 4, logical reads 167426, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. CPU time = 342 ms, elapsed time = 233 ms.
To loop over the results and update in smaller batches:
declare @rowcount int = 1;
declare @batch_size int = 1000;
while @rowcount > 0
begin
update top(@batch_size) t set t.status_val = 'already updated'
from (
select *, row_number() over(partition by object_id order by (select null)) row_num
from foo
where status_val <> 'already updated'
) t
where row_num > 1
set @rowcount = @@rowcount;
end
This will help keep locking down if other concurrent sessions are trying to access this table.
Upvotes: 5
Reputation: 4743
According to your question, it seems that for each value of object_id, you want to keep the status_val = 0 for the object_id with lowest id and = 2 for the others. If that is indeed the case, and IF an object_id repeats only a maximum of 3 times, then I have a very simple solution for you. Use the modulo or remainder operator to get what you want. Here is the answer which I will explain later:
update [MyTable]
set status_val = 2
where (id%3) != 1
When you divide any value of id
by 3, the remainder can be only 0,1 or 2. So, for each object_id where id%3 is not 1, we change the status_val to 2.
Before executing the above code, see the output of this query -
select id, (id%3) as flg, object_id, status_val
from MyTable
Upvotes: 1
Reputation: 732
UPDATE Table
SET Table.status_val = '2'
FROM Table
INNER JOIN
(SELECT id, row_number()OVER(PARTITION BY object_id ORDER BY id) as seq FROM Table) other_table
ON Table.id = other_table.id AND seq <> 1
Upvotes: 1