Reputation: 1
I am removing duplicates from a table with a transaction_id
column and a last_modified_date
column (see below the query). The idea is that I should have one record per transaction_id
, so I need to remove duplicates, keeping the last modified record for a given transaction_id
.
The query works but is slow.
The question is: what index I should create to speed up the query execution time?
With CTE_Duplicates as
(
select
transaction_id,
row_number() over (partition by transaction_id order by last_modified_date desc) rownumber
from
TRANSACTIONS
)
delete from CTE_Duplicates
where rownumber != 1;
Thanks!
Vald
Upvotes: 0
Views: 47
Reputation: 15941
No matter what solution you choose, probably the best thing you can do is add a compound index on (transaction_id, last_modified_date). After doing that, I would go with an aggregate function over a windowing one (given their partitioning and ordering abilities, I am not sure how well they would take advantage of the ideal index)...
; WITH keepers AS (
SELECT transaction_id, MAX(last_modified_date) AS last_modified_date
FROM transactions
GROUP BY transaction_id
)
DELETE t
FROM transactions AS t
LEFT JOIN keepers AS k
ON t.transaction_id = k.transaction_id
AND t.last_modified_date = k.last_modified_date
WHERE k.transaction_id IS NULL
;
Upvotes: 0
Reputation: 1269563
For your version of the query:
With CTE_Duplicates as (
select t.*,
row_number() over (partition by transaction_id order by last_modified_date desc) as rownumber
from TRANSACTIONS
)
delete from CTE_Duplicates
where rownumber > 1;
You want an index on (transaction_id, last_modified_date desc)
. However, with that same index, it might be faster to phrase the query as:
delete t from transactions t
where t.last_modified_date = (select max(t2.last_modified_date)
from transactions t2
where t2.transaction_id = t.transaction_id
);
All that said, your query will be quite expensive if many rows are being deleted ("many" might even be a few percent). In that case, a temporary table solution might be better:
select t.*
into temp_transactions
from transactions t
where t.last_modified_date = (select max(t2.last_modified_date)
from transactions t2
where t2.transaction_id = t.transaction_id
);
truncation table temp_transactions; -- backup first!
insert into transactions
select *
from temp_transactions;
Of course, the logic will be more complicated if you have identity columns or triggers that set values on the table.
Upvotes: 1
Reputation: 222432
For this query:
with CTE_Duplicates as (
select
transaction_id,
row_number()
over(partition by transaction_id order by last_modified_date desc ) rownumber
from TRANSACTIONS
)
delete from CTE_Duplicates where rownumber!=1;
You just want a composite index on (transaction_id, last_modified_date)
.
create index idx_transactions_dup on transactions(transaction_id, last_modified_date);
Upvotes: 0