Reputation: 687
I would like to know what is the most optimized SQL for this problem. I have to apply a SQL query on a table which contains invalid data in order to correct all those invalid data. The table is structured like below :
TABLE(customer_id, start_date, end_date, type)
Currently, the table can contain many lines for a given tuple (customer_id,type). My query needs to "merge" all line belonging to a group into a single one and keep the most recent date for start date and the oldest for end date :
Cust1;01/01/2012;01/01/2020;1
Cust1;01/01/2010;01/01/2024;1
should transform into a single line
Cust1;01/01/2012;01/01/2024;1
I need to correct the data not only select it : delete lines if there is more than 1 and retrieve datas from each ! I hope my explanations are clear enough ! I work with Oracle DBMS
Thank you,
Upvotes: 1
Views: 74
Reputation: 21170
If you have only small number of duplicated row an alternative approch using in place update / delete will be prefereble.
So first check the number of duplicated rows
with clean as (
select CUSTOMER_ID, TYPE, max(start_date) start_date_clean, max(end_date) end_date_clean
from tab group by CUSTOMER_ID, TYPE)
select tab.*, start_date_clean, end_date_clean
from tab join clean on tab.CUSTOMER_ID = clean.CUSTOMER_ID and tab.TYPE = clean.TYPE
where start_date != start_date_clean or end_date != end_date_clean
;
This query will return all rows that will be processed, i.e. either start or end date are not correct.
If this number is large - go the way as proposed by other answer - copy the table and replace the original table with the copy.
If the **number is small*, go the update
/ delete
way:
update tab a
set a.START_DATE = (select max(b.START_DATE) from tab b where a.customer_id = b.customer_id and a.type = b.type),
a.END_DATE = (select max(b.END_DATE) from tab b where a.customer_id = b.customer_id and a.type = b.type)
where (a.customer_id, a.type) in
(
select tab.CUSTOMER_ID, tab.TYPE
from tab join
(select CUSTOMER_ID, TYPE, max(start_date) start_date_clean, max(end_date) end_date_clean
from tab group by CUSTOMER_ID, TYPE) clean
on tab.CUSTOMER_ID = clean.CUSTOMER_ID and tab.TYPE = clean.TYPE
where start_date != start_date_clean or end_date != end_date_clean);
This updates in all affected rows the start
and end
date to the correct vales.
example
CUSTOMER_ID START_DATE END_DATE TYPE
----------- ------------------- ------------------- ----------
1 01-01-2013 00:00:00 01-01-2016 00:00:00 1
1 01-01-2012 00:00:00 01-01-2018 00:00:00 1
1 01-01-2010 00:00:00 01-01-2017 00:00:00 1
2 01-01-2010 00:00:00 01-01-2018 00:00:00 1
3 01-01-2010 00:00:00 01-01-2018 00:00:00 1
is updated to
CUSTOMER_ID START_DATE END_DATE TYPE
----------- ------------------- ------------------- ----------
1 01-01-2013 00:00:00 01-01-2018 00:00:00 1
1 01-01-2013 00:00:00 01-01-2018 00:00:00 1
1 01-01-2013 00:00:00 01-01-2018 00:00:00 1
2 01-01-2010 00:00:00 01-01-2018 00:00:00 1
3 01-01-2010 00:00:00 01-01-2018 00:00:00 1
In the next step the duplicated rows must be deleted. This makes the next delete which user the ROW_NUMBER
to identify the duplicates:
delete from tab where rowid in
(select RID from (
select rowid rid,
row_number() over (partition by CUSTOMER_ID, TYPE order by null) rn
from tab)
where rn > 1)
;
What you see - the brute force copy approach is simple in queries, but leaves the table offline for some time. You need twice as space to perform it and it will take some time.
The update approach is more complicated, but goes without maintainance window and is quickly done.
Upvotes: 1
Reputation: 32021
use max()
function
select customer_id,type, max(start_date),max(end_date)
from t1
group by customer_id,type
i think you want create another table by using these data
create table test_t as
select customer_id,type, max(start_date),max(end_date)
from t1
group by customer_id,type
Upvotes: 2
Reputation: 50173
Do the aggregation :
select customer_id, max(start_date), max(end_date), type
from table t
group by customer_id, type;
Upvotes: 1