LostReality
LostReality

Reputation: 687

SQL Merge/Group lines into a single one

I would like to know what is the most optimized SQL for this problem. I have to apply a SQL query on a table which contains invalid data in order to correct all those invalid data. The table is structured like below :

TABLE(customer_id, start_date, end_date, type)

Currently, the table can contain many lines for a given tuple (customer_id,type). My query needs to "merge" all line belonging to a group into a single one and keep the most recent date for start date and the oldest for end date :

Cust1;01/01/2012;01/01/2020;1
Cust1;01/01/2010;01/01/2024;1

should transform into a single line

Cust1;01/01/2012;01/01/2024;1

I need to correct the data not only select it : delete lines if there is more than 1 and retrieve datas from each ! I hope my explanations are clear enough ! I work with Oracle DBMS

Thank you,

Upvotes: 1

Views: 74

Answers (3)

Marmite Bomber
Marmite Bomber

Reputation: 21170

If you have only small number of duplicated row an alternative approch using in place update / delete will be prefereble.

So first check the number of duplicated rows

with clean as (
select CUSTOMER_ID, TYPE, max(start_date) start_date_clean, max(end_date)  end_date_clean
from tab group by CUSTOMER_ID, TYPE)
select tab.*, start_date_clean, end_date_clean
from tab join clean on tab.CUSTOMER_ID = clean.CUSTOMER_ID and tab.TYPE = clean.TYPE
where  start_date != start_date_clean or  end_date != end_date_clean
;

This query will return all rows that will be processed, i.e. either start or end date are not correct.

If this number is large - go the way as proposed by other answer - copy the table and replace the original table with the copy.

If the **number is small*, go the update / delete way:

update tab a
set a.START_DATE = (select max(b.START_DATE) from tab b where a.customer_id = b.customer_id and a.type = b.type),
a.END_DATE = (select max(b.END_DATE) from tab b where a.customer_id = b.customer_id and a.type = b.type)
where (a.customer_id, a.type) in 
( 
select tab.CUSTOMER_ID, tab.TYPE 
from tab join 
(select CUSTOMER_ID, TYPE, max(start_date) start_date_clean, max(end_date)  end_date_clean
from tab group by CUSTOMER_ID, TYPE) clean 
on tab.CUSTOMER_ID = clean.CUSTOMER_ID and tab.TYPE = clean.TYPE
where  start_date != start_date_clean or  end_date != end_date_clean);

This updates in all affected rows the start and end date to the correct vales.

example

CUSTOMER_ID START_DATE          END_DATE                  TYPE
----------- ------------------- ------------------- ----------
          1 01-01-2013 00:00:00 01-01-2016 00:00:00          1 
          1 01-01-2012 00:00:00 01-01-2018 00:00:00          1 
          1 01-01-2010 00:00:00 01-01-2017 00:00:00          1 
          2 01-01-2010 00:00:00 01-01-2018 00:00:00          1 
          3 01-01-2010 00:00:00 01-01-2018 00:00:00          1

is updated to

CUSTOMER_ID START_DATE          END_DATE                  TYPE
----------- ------------------- ------------------- ----------
          1 01-01-2013 00:00:00 01-01-2018 00:00:00          1 
          1 01-01-2013 00:00:00 01-01-2018 00:00:00          1 
          1 01-01-2013 00:00:00 01-01-2018 00:00:00          1 
          2 01-01-2010 00:00:00 01-01-2018 00:00:00          1 
          3 01-01-2010 00:00:00 01-01-2018 00:00:00          1 

In the next step the duplicated rows must be deleted. This makes the next delete which user the ROW_NUMBER to identify the duplicates:

delete from tab where rowid in 
(select RID from (
  select rowid rid,
  row_number() over (partition by CUSTOMER_ID, TYPE order by null) rn
  from tab) 
where rn > 1)
;

What you see - the brute force copy approach is simple in queries, but leaves the table offline for some time. You need twice as space to perform it and it will take some time.

The update approach is more complicated, but goes without maintainance window and is quickly done.

Upvotes: 1

Zaynul Abadin Tuhin
Zaynul Abadin Tuhin

Reputation: 32021

use max() function

    select customer_id,type, max(start_date),max(end_date)
    from t1
    group by customer_id,type

i think you want create another table by using these data

 create table test_t as
 select customer_id,type, max(start_date),max(end_date)
        from t1
        group by customer_id,type 

Upvotes: 2

Yogesh Sharma
Yogesh Sharma

Reputation: 50173

Do the aggregation :

select customer_id, max(start_date), max(end_date), type
from table t
group by customer_id, type;

Upvotes: 1

Related Questions