Vald
Vald

Reputation: 1

Create SQL indexes for speed

I am removing duplicates from a table with a transaction_id column and a last_modified_date column (see below the query). The idea is that I should have one record per transaction_id, so I need to remove duplicates, keeping the last modified record for a given transaction_id.

The query works but is slow.

The question is: what index I should create to speed up the query execution time?

With CTE_Duplicates as
(
   select 
       transaction_id, 
       row_number() over (partition by transaction_id order by last_modified_date desc) rownumber 
   from 
       TRANSACTIONS 
)  
delete from CTE_Duplicates 
where rownumber != 1;  

Thanks!

Vald

Upvotes: 0

Views: 47

Answers (3)

Uueerdo
Uueerdo

Reputation: 15941

No matter what solution you choose, probably the best thing you can do is add a compound index on (transaction_id, last_modified_date). After doing that, I would go with an aggregate function over a windowing one (given their partitioning and ordering abilities, I am not sure how well they would take advantage of the ideal index)...

; WITH keepers AS (
   SELECT transaction_id, MAX(last_modified_date) AS last_modified_date
   FROM transactions 
   GROUP BY transaction_id
)
DELETE t 
FROM transactions AS t
LEFT JOIN keepers AS k
   ON t.transaction_id = k.transaction_id 
   AND t.last_modified_date = k.last_modified_date
WHERE k.transaction_id IS NULL
;

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269563

For your version of the query:

With CTE_Duplicates as (
    select t.*,
           row_number() over (partition by transaction_id order by last_modified_date desc) as rownumber
    from TRANSACTIONS
   )
delete from CTE_Duplicates
    where rownumber > 1;

You want an index on (transaction_id, last_modified_date desc). However, with that same index, it might be faster to phrase the query as:

delete t from transactions t
    where t.last_modified_date = (select max(t2.last_modified_date)
                                  from transactions t2
                                  where t2.transaction_id = t.transaction_id
                                 );

All that said, your query will be quite expensive if many rows are being deleted ("many" might even be a few percent). In that case, a temporary table solution might be better:

select t.*
into temp_transactions
from transactions t
where t.last_modified_date = (select max(t2.last_modified_date)
                              from transactions t2
                              where t2.transaction_id = t.transaction_id
                             );

truncation table temp_transactions;  -- backup first!

insert into transactions
    select *
    from temp_transactions;

Of course, the logic will be more complicated if you have identity columns or triggers that set values on the table.

Upvotes: 1

GMB
GMB

Reputation: 222432

For this query:

with CTE_Duplicates as (
    select 
        transaction_id, 
        row_number() 
            over(partition by transaction_id order by last_modified_date desc ) rownumber 
    from TRANSACTIONS 
) 
delete from CTE_Duplicates where rownumber!=1;

You just want a composite index on (transaction_id, last_modified_date).

create index idx_transactions_dup on transactions(transaction_id, last_modified_date);

Upvotes: 0

Related Questions