Reputation: 4405
I have been given this task to try to detect some duplicate records in a table with a large volume of rows. The table comprises 2 joined tables. So to begin with I have:
select
b.event_number_id, b.tenure_number_id, a.work_start_date, a.work_stop_date, a.amount
from
MTA.mta_sow_event a, mta_tenure_event_xref b
where
a.event_number_id = b.event_number_id
Now we have a table to work from. The duplicate records have unique event_number_id, the reamining fields will contain identical data, so something like this:
| event_number_id | tenure_number_id | work_start_date | work_stop_date |amount|
|-----------------|-------------------|-----------------|----------------|------|
| 5532733 | 688203 | 01-SEP-14 | 25-SEP-14 | 5000 |
| 5532734 | 688203 | 01-SEP-14 | 25-SEP-14 | 5000 |
So, this is an example of a duplicate record. There are consecutive event_number_id's and all the remaining columns have identical information. We believe that our system has been creating duplicate events for some time now (this isn't supposed to happen), so I want to query the whole joined table and find anything that has rows that have exactly the same data, but different and consecutive event numbers.
So far, I managed to make a simple query that shows me any rows that have identical information, excluding the event_number_id column:
select
b.tenure_number_id, a.work_start_date, a.work_stop_date, a.amount, count(*)
from
MTA.mta_sow_event a, mta_tenure_event_xref b
where
a.event_number_id = b.event_number_id
group by
b.tenure_number_id, a.work_start_date, a.work_stop_date, a.amount
having
count(*) > 1
which returns:
| tenure_number_id | work_start_date | work_stop_date |amount|Count(*)|
|-------------------|-----------------|----------------|------|--------|
| 688203 | 01-SEP-14 | 25-SEP-14 | 5000 | 2 |
The problem is, sometimes there are rows that have identical data, but could be valid, so the best we can do at this point is find any of these matching rows that have consecutive event_number_id's. This is where I am hung up. Is there a way to pull out only the rows that contain these consecutive numbers?
Upvotes: 2
Views: 4126
Reputation: 5565
General idea: group rows with the same values (partition by tenure_number_id, work_start_date, work_end_date, amount
), find minimal event_number_id
for each group and row number of event_number_id
inside group starting from zero (using analytic functions min
and row_number
), then compare sum of minimal ID and row number with event_number_id
. For consecutive numbers they have to be equal:
with t as (select b.event_number_id, b.tenure_number_id, a.work_start_date, a.work_stop_date, a.amount
from MTA.mta_sow_event a, mta_tenure_event_xref b
where a.event_number_id = b.event_number_id)
select *
from (select t.*,
min(event_number_id) over (partition by tenure_number_id, work_start_date, work_end_date, amount) +
row_number() over (partition by tenure_number_id, work_start_date, work_end_date, amount order by event_number_id) - 1 group_id
from t)
where event_number_id = group_id
Upvotes: 1
Reputation: 52336
Here's an approach based on a join of the data sets:
with cte_base_data as (
select
... your query here ...)
select
from cte_base_data t1 join
cte_base_data t2 on (t1.tenure_number_id = t2.tenure_number_id and
t1.work_start_date = t2.work_start_date and
t1.work_stop_date = t2.work_stop_date and
t1.amount = t2.amount)
where t1.event_number_id = t2.event_number_id - 1;
The efficiency will depend on a few factors, such as the efficiency of scanning the base tables and the size of the data sets.
It would be interesting to see a comparison of the execution plans of this method and the analytics-function methods. This common table expression-based join ought to be very efficient as it depends on hash joins, which have almost no cost as long as they stay in memory (a big question mark over that).
I'd be inclined to go for the analytic functions if the event_number_id's were not consecutive -- if there might be gaps, for instance, which would be harder to implement as a join. Given that one of them is the other incremented, I think it's worth taking a punt on a join.
Upvotes: 1
Reputation: 14848
You can use analytic functions LAG and LEAD for this:
with t as (
select event_number_id eid, b.tenure_number_id tid,
a.work_start_date d1, a.work_stop_date d2, a.amount amt
from mta_sow_event a join mta_tenure_event_xref b using (event_number_id) )
select eid event_number_id, tid tenure_number_id,
d1 work_start_date, d2 work_stop_date, amt amount
from (
select t.*,
lag(eid) over (partition by tid, d1, d2, amt order by eid) l1,
lead(eid) over (partition by tid, d1, d2, amt order by eid) l2
from t )
where eid in (l1+1, l2-1) order by eid, tid
Upvotes: 0