Angelo
Angelo

Reputation: 1656

Oracle Merging Two Tables With Date Gaps

This question is quite related to another one I posted recently, but I am posting a new one as this provides a little more complexity in resolving. I am seeking the help of some Oracle ninjas and rockstars out there and I feel that is a good challenge and exercise on their expertise.

Basically I have two tables, TableA and TableB.

-- For TableA
CREATE TABLE TableA
(
  ID          VARCHAR2(10),
  LOCN        VARCHAR2(10),
  START_DATE  DATE,
  END_DATE    DATE
)
STORAGE    (
            BUFFER_POOL      DEFAULT
           )
LOGGING
NOCOMPRESS
NOCACHE
NOPARALLEL
NOMONITORING
/


-- Populate TableA
INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1P1',   '01',   TO_DATE('02/04/1996', 'MM/DD/YYYY'),  TO_DATE('02/22/1996', 'MM/DD/YYYY');


INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1P1',   '01',   TO_DATE('02/23/1996', 'MM/DD/YYYY'),  TO_DATE('05/28/2002', 'MM/DD/YYYY');


INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1P1',   '01',   TO_DATE('05/29/2002', 'MM/DD/YYYY'),  TO_DATE('05/03/2005', 'MM/DD/YYYY');


INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1P1',   '01',   TO_DATE('05/04/2005', 'MM/DD/YYYY'),  TO_DATE('05/04/2005', 'MM/DD/YYYY');


INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1P2',   '30',   TO_DATE('01/31/1996', 'MM/DD/YYYY'),  TO_DATE('02/06/1996', 'MM/DD/YYYY');


INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1P2',   '02',   TO_DATE('02/07/1996', 'MM/DD/YYYY'),  TO_DATE('02/13/1996', 'MM/DD/YYYY');


INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1P2',   '02',   TO_DATE('02/14/1996', 'MM/DD/YYYY'),  TO_DATE('01/01/2099', 'MM/DD/YYYY');


INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1P3',   '03',   TO_DATE('02/07/1996', 'MM/DD/YYYY'),  TO_DATE('02/13/1996', 'MM/DD/YYYY');


INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1P3',   '03',   TO_DATE('02/14/1996', 'MM/DD/YYYY'),  TO_DATE('01/01/2099', 'MM/DD/YYYY');


INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1S4',   '42',   TO_DATE('11/06/2001', 'MM/DD/YYYY'),  TO_DATE('01/01/2099', 'MM/DD/YYYY');


INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('3S4',   '42',   TO_DATE('11/06/2001', 'MM/DD/YYYY'),  TO_DATE('01/01/2099', 'MM/DD/YYYY');



-- For TableB
CREATE TABLE TableB
(
  ID           VARCHAR2(10),
  POSTING      VARCHAR2(20),
  DESCRIPTION  VARCHAR2(100),
  OTHER_ID     VARCHAR2(10),
  START_DATE   DATE,
  END_DATE     DATE
)
STORAGE    (
            BUFFER_POOL      DEFAULT
           )
LOGGING
NOCOMPRESS
NOCACHE
NOPARALLEL
NOMONITORING
/


INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('1P1', 'PROFESSOR', 'Sch 1 Quad 1 Area', 'P1', '02/04/1996', '01/01/2099');

INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('1P2', 'PROFESSOR', 'Sch 1 Quad 2 Area', 'P2', '01/31/1996', '01/01/2099');

INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('1P3', 'PROFESSOR', 'Sch 1 Quad 3 Area', 'P3', '02/05/1996', '01/01/2099');

INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('1S4', 'SUPERVISOR', 'Sch 1 CO Supervisor 4', '1S4', '02/05/1996', '03/18/2002');

INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('1S4', 'SUPERINTENDENT', 'Sch 1 CD Superintendent', '1S4', '03/19/2002', '06/09/2009');

INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('1S4', 'SUPERVISOR', 'Sch 1 CO Supervisor 4', '1S4', '06/10/2009', '01/01/2099');

INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('2S5', 'SUPERVISOR', 'Sch 2 CAO Supervisor 5', '2S5', '10/26/2002', '06/09/2009');

INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('2S5', 'SUPERINTENDENT', 'Sch 2 CAO Superintendent 5', '2S5', '06/10/2009', '07/14/2009');

INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('2S5', 'SUPERINTENDENT', 'Sch 2 CAO Superintendent 5', 'S5', '07/15/2009', '01/01/2099');

INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('3S4', 'SUPERVISOR', 'Sch 3 CO Supervisor 4', '3S4', '02/05/1996', '03/18/2002');

INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('3S4', 'SUPERINTENDENT', 'Sch 3 CD Superintendent', '3S4', '03/19/2002', '06/09/2009');

INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('3S4', 'SUPERVISOR', 'Sch 3 CO Supervisor 4', '3S4', '06/10/2009', '01/01/2099');

The process goes like this: In TableA, all records with the same ID, LOCN and having continuous START_DATE and END_DATE dates will be combined.

ID  LOCN    START_DATE  END_DATE
1P1 01      02/04/1996  05/04/2005
1P2 30      01/31/1996  02/06/1996
1P2 02      02/07/1996  01/01/2099
1P3 03      02/07/1996  01/01/2099
1S4 42      11/06/2001  01/01/2099
3S4 42      11/06/2001  01/01/2099

In TableB, all records with the same ID, POSTING, OTHER_ID and continuous START_DATE and END_DATE will also be combined. (I believe there's no data anyway that may be combined from this table).

ID  POSTING         DESCRIPTION                 OTHER_ID    START_DATE  END_DATE
1P1 PROFESSOR       Sch 1 Quad 1 Area           P1          02/04/1996  01/01/2099
1P2 PROFESSOR       Sch 1 Quad 2 Area           P2          01/31/1996  01/01/2099
1P3 PROFESSOR       Sch 1 Quad 3 Area           P3          02/05/1996  01/01/2099
1S4 SUPERVISOR      Sch 1 CO Supervisor 4       1S4         02/05/1996  03/18/2002
1S4 SUPERINTENDENT  Sch 1 CD Superintendent     1S4         03/19/2002  06/09/2009
1S4 SUPERVISOR      Sch 1 CO Supervisor 4       1S4         06/10/2009  01/01/2099
2S5 SUPERVISOR      Sch 2 CAO Supervisor 5      2S5         10/26/2002  06/09/2009
2S5 SUPERINTENDENT  Sch 2 CAO Superintendent 5  2S5         06/10/2009  07/14/2009
2S5 SUPERINTENDENT  Sch 2 CAO Superintendent 5  S5          07/15/2009  01/01/2099
3S4 SUPERVISOR      Sch 3 CO Supervisor 4       3S4         02/05/1996  03/18/2002
3S4 SUPERINTENDENT  Sch 3 CD Superintendent     3S4         03/19/2002  06/09/2009
3S4 SUPERVISOR      Sch 3 CO Supervisor 4       3S4         06/10/2009  01/01/2099

Merge the records from TableA and TableB based on the ID. The LOCN column will be added to Table B and will only be carried forward based on the date range from TableA. The resulting data should look like this:

ID  UNIT_TYPE       DESCRIPTION                 OTHER_ID    START_DATE  END_DATE    LOCN
1P1 PROFESSOR       Sch 1 Quad 1 Area           P1          02/04/1996  05/04/2005  01
1P1 PROFESSOR       Sch 1 Quad 1 Area           P1          05/05/2005  01/01/2099  {NULL}
1P2 PROFESSOR       Sch 1 Quad 2 Area           P2          01/31/1996  02/06/1996  30
1P2 PROFESSOR       Sch 1 Quad 2 Area           P2          02/07/1996  01/01/2099  02
1P3 PROFESSOR       Sch 1 Quad 3 Area           P3          02/05/1996  02/06/1996  {NULL}
1P3 PROFESSOR       Sch 1 Quad 3 Area           P3          02/07/1996  01/01/2099  03
1S4 SUPERVISOR      Sch 1 CO Supervisor 4       1S4         02/05/1996  11/05/2001  {NULL}
1S4 SUPERVISOR      Sch 1 CO Supervisor 4       1S4         11/06/2001  03/18/2002  42
1S4 SUPERINTENDENT  Sch 1 CD Superintendent     1S4         03/19/2002  06/09/2009  42
1S4 SUPERVISOR      Sch 1 CO Supervisor 4       1S4         06/10/2009  01/01/2099  42
2S5 SUPERVISOR      Sch 2 CAO Supervisor 5      2S5         10/26/2002  06/09/2009  {NULL}
2S5 SUPERINTENDENT  Sch 2 CAO Superintendent 5  2S5         06/10/2009  07/14/2009  {NULL}
2S5 SUPERINTENDENT  Sch 2 CAO Superintendent 5  S5          07/15/2009  01/01/2099  {NULL}
3S4 SUPERVISOR      Sch 3 CO Supervisor 4       3S4         02/05/1996  11/05/2001  {NULL}
3S4 SUPERVISOR      Sch 3 CO Supervisor 4       3S4         11/06/2001  03/18/2002  42
3S4 SUPERINTENDENT  Sch 3 CD Superintendent     3S4         03/19/2002  06/09/2009  42
3S4 SUPERVISOR      Sch 3 CO Supervisor 4       3S4         06/10/2009  01/01/2099  42

Would love to hear any approach feasible to solve this. Thans a lot.

Addition: This is the query I have written so far to collapse the records in TableA

SELECT ID, LOCN, TO_CHAR(MIN(START_DATE), 'MM/DD/YYYY') START_DATE, TO_CHAR(MAX(END_DATE), 'MM/DD/YYYY') END_DATE
        FROM
             (
              SELECT ID, LOCN, START_DATE, END_DATE, MAX(GRP) OVER (ORDER BY ID, START_DATE) GRP
              FROM
                  (
                   SELECT ID, LOCN,
                          CASE WHEN START_DATE - LAG(END_DATE) OVER (PARTITION BY ID, LOCN ORDER BY START_DATE ASC) <= 1 THEN
                            NULL
                          ELSE
                            ROWNUM
                          END GRP,
                          START_DATE,
                          NVL(END_DATE, SYSDATE) END_DATE
                   FROM TableA
                   ORDER BY ID ASC, START_DATE ASC
                  )
             )
        GROUP BY ID, LOCN, GRP
        ORDER BY ID ASC, START_DATE ASC;

Upvotes: 2

Views: 1913

Answers (1)

Alex Poole
Alex Poole

Reputation: 191235

Since the rockstars are busy leading their debauched (if well-earned) lifestyles and the ninjas look like they'll be busy for a while, I'll have a go...

The way you've laid it out, you want to collapse contiguous records in TableA first and use that result against a (possibly collapsed) TableB. I'm not sure doing that as a separate step is ideal for solving the overall problem, but I'll go with it for now. The general way I've found simplest to collapse the rows is something like:

select id, locn, max(start_date) as start_date, max(end_date) as end_date
from (
    select id, locn,
        case when start_date = lag_end_date  + interval '1' day then null
            else start_date end as start_date,
        case when end_date = lead_start_date - interval '1' day then null
            else end_date end as end_date,
        row_number() over (partition by id order by start_date)
            - row_number() over (partition by id, locn
                order by start_date) as chain
    from (
        select id, locn, start_date, end_date,
            lead(start_date) over (partition by id, locn
                order by start_date) as lead_start_date,
            lag(end_date) over (partition by id, locn
                order by start_date) as lag_end_date
        from TableA
    )   
)
group by id, locn, chain
order by 1, 3, 2;

ID         LOCN       START_DATE END_DATE
---------- ---------- ---------- ----------
1P1        01         02/04/1996 05/04/2005
1P2        02         02/07/1996 01/01/2099
1P2        30         01/31/1996 02/06/1996
1P3        03         02/07/1996 01/01/2099
1S4        42         11/06/2001 01/01/2099
3S4        42         11/06/2001 01/01/2099

The innermost select uses lead and lag to peek at adjacent rows (and you hinted at that in your previous question).

The next layer sets contiguous values (i.e. where one row's start date is the day after the previous row's end date) to null; if you run just that part you'll see the contiguous range start and ends emerge. It also adds a chain pseudocolumn which lets it cope with an id switching back to a previously-used locn; say having 1P2 go back to locn=30. (This is an approach I originally saw here, but see more about gaps and islands too). Without that all 'islands' of an id/locn would be treated as one block and you'd end up with overlapping date ranges.

The outer layer user min and max to remove the nulls and produce the final result.

Using that you could - if you're on 11gR2 - use a recursive CTE to join recursively to get all the combinations. This is only my second real stab at one of these so others may well point out flaws or improvements, if they can tear themselves away from their M&Ms... might give you some pointers though.

with a as (
    select id, locn, max(start_date) as start_date, max(end_date) as end_date
    from (
        select id, locn,
            case when start_date = lag_end_date  + interval '1' day then null
                else start_date end as start_date,
            case when end_date = lead_start_date - interval '1' day then null
                else end_date end as end_date,
            row_number() over (partition by id order by start_date)
                - row_number() over (partition by id, locn
                    order by start_date) as chain
        from (
            select id, locn, start_date, end_date,
                lead(start_date) over (partition by id, locn
                    order by start_date) as lead_start_date,
                lag(end_date) over (partition by id, locn
                    order by start_date) as lag_end_date
            from TableA
        )
    )
    group by id, locn, chain
),
b as (
    select id, posting, description, other_id, start_date, end_date,
        row_number() over (partition by id, posting, description,
            other_id order by start_date, end_date) as rn
    from TableB
),
r (id, posting, description, other_id, rn, start_date, end_date, locn) as (
    select b.id, b.posting, b.description, b.other_id, b.rn,
        b.start_date,
        case
            when not (a.start_date > b.end_date or a.end_date < b.start_date)
                and a.start_date <= b.end_date and a.end_date < b.end_date
                then a.end_date
            when not (a.start_date > b.end_date or a.end_date < b.start_date)
                and a.start_date <= b.end_date and a.start_date > b.start_date
                then a.start_date - interval '1' day
            else b.end_date
        end as end_date,
        case
            when a.start_date <= b.start_date and a.end_date >= b.start_date
                then a.locn
        end
    from b
    left join (
        select id, locn, start_date, end_date,
            row_number() over (partition by id order by start_date) as rn
        from a
    ) a on a.id = b.id
        and a.rn = 1
    union all
    select b.id, b.posting, b.description, b.other_id, b.rn,
        case
            when a.start_date is null then r.end_date + interval '1' day
            else a.start_date
        end as start_date,
        case
            when a.start_date is null then b.end_date
            when not (a.start_date > r.end_date or a.end_date < r.start_date)
                then least(a.end_date, b.end_date)
            when a.end_date < b.end_date then a.start_date - interval '1' day
            else b.end_date
        end as end_date,
        a.locn
    from b
    join r on r.id = b.id
        and r.posting = b.posting
        and r.description = b.description
        and r.other_id = b.other_id
        and r.rn = b.rn
        and r.start_date = b.start_date
        and r.end_date < b.end_date
    left join a on a.id = r.id
        and a.start_date > r.end_date
) 
select id, posting as unit_type, description, other_id,
    start_date, end_date, locn
from r
order by id, start_date;

This gets the result you wanted I believe:

ID         UNIT_TYPE            DESCRIPTION                    OTHER_ID   START_DATE END_DATE   LOCN
---------- -------------------- ------------------------------ ---------- ---------- ---------- ----------
1P1        PROFESSOR            Sch 1 Quad 1 Area              P1         02/04/1996 05/04/2005 01
1P1        PROFESSOR            Sch 1 Quad 1 Area              P1         05/05/2005 01/01/2099
1P2        PROFESSOR            Sch 1 Quad 2 Area              P2         01/31/1996 02/06/1996 30
1P2        PROFESSOR            Sch 1 Quad 2 Area              P2         02/07/1996 01/01/2099 02
1P3        PROFESSOR            Sch 1 Quad 3 Area              P3         02/05/1996 02/06/1996
1P3        PROFESSOR            Sch 1 Quad 3 Area              P3         02/07/1996 01/01/2099 03
1S4        SUPERVISOR           Sch 1 CO Supervisor 4          1S4        02/05/1996 11/05/2001
1S4        SUPERVISOR           Sch 1 CO Supervisor 4          1S4        11/06/2001 03/18/2002 42
1S4        SUPERINTENDENT       Sch 1 CD Superintendent        1S4        03/19/2002 06/09/2009 42
1S4        SUPERVISOR           Sch 1 CO Supervisor 4          1S4        06/10/2009 01/01/2099 42
2S5        SUPERVISOR           Sch 2 CAO Supervisor 5         2S5        10/26/2002 06/09/2009
2S5        SUPERINTENDENT       Sch 2 CAO Superintendent 5     2S5        06/10/2009 07/14/2009
2S5        SUPERINTENDENT       Sch 2 CAO Superintendent 5     S5         07/15/2009 01/01/2099
3S4        SUPERVISOR           Sch 3 CO Supervisor 4          3S4        02/05/1996 11/05/2001
3S4        SUPERVISOR           Sch 3 CO Supervisor 4          3S4        11/06/2001 03/18/2002 42
3S4        SUPERINTENDENT       Sch 3 CD Superintendent        3S4        03/19/2002 06/09/2009 42
3S4        SUPERVISOR           Sch 3 CO Supervisor 4          3S4        06/10/2009 01/01/2099 42

17 rows selected.

This is using three CTEs. a is as described above, a collapsed version of TableA. b is TableB but with the addition of a row number column I think I need later to keep records in step during the recursion. r is where the fun starts.

The first part of r generates the initial data for each TableB entry, with matching values from TableA if appropriate - but only from the first matching record if there might be more than one. The tricky bit here is figuring out what end_date should be. If there is no overlapping TableA record at all then it can just be the TableB end date; if there is but it starts after the TableB record then this needs to end immediately before the TableA kicks in. Otherwise it depends if the TableA record ends before or after the TableB one.

Running just that part:

with a as (...), b as (...)
select b.id, b.posting, b.description, b.other_id, b.rn,
    b.start_date,
    case
        when not (a.start_date > b.end_date or a.end_date < b.start_date)
            and a.start_date <= b.end_date and a.end_date < b.end_date
            then a.end_date
        when not (a.start_date > b.end_date or a.end_date < b.start_date)
            and a.start_date <= b.end_date and a.start_date > b.start_date
            then a.start_date - interval '1' day
        else b.end_date
    end as end_date,
    case
        when a.start_date <= b.start_date and a.end_date >= b.start_date
            then a.locn
    end
from b
left join (
    select id, locn, start_date, end_date,
        row_number() over (partition by id order by start_date) as rn
    from a
) a on a.id = b.id
    and a.rn = 1
order by id, start_date;

... gives this (description suppressed for readability):

ID         UNIT_TYPE            OTHER_ID   START_DATE END_DATE   LOCN
---------- -------------------- ---------- ---------- ---------- ----------
1P1        PROFESSOR            P1         02/04/1996 05/04/2005 01
1P2        PROFESSOR            P2         01/31/1996 02/06/1996 30
1P3        PROFESSOR            P3         02/05/1996 02/06/1996
1S4        SUPERVISOR           1S4        02/05/1996 11/05/2001
1S4        SUPERINTENDENT       1S4        03/19/2002 06/09/2009 42
1S4        SUPERVISOR           1S4        06/10/2009 01/01/2099 42
2S5        SUPERVISOR           2S5        10/26/2002 06/09/2009
2S5        SUPERINTENDENT       2S5        06/10/2009 07/14/2009
2S5        SUPERINTENDENT       S5         07/15/2009 01/01/2099
3S4        SUPERVISOR           3S4        02/05/1996 11/05/2001
3S4        SUPERINTENDENT       3S4        03/19/2002 06/09/2009 42
3S4        SUPERVISOR           3S4        06/10/2009 01/01/2099 42

12 rows selected.

For IP3, there is initially no matching TableA record, but notice that end_date is set to the day before the one that does match later starts.

The second part of r, the union all, is the recursive part. For each TableB record, it's joining back to itself looking for a record where the generated end_date is earlier than the original's, as is the case for IP3, which means there is a period of time still to fill in. It then looks for a suitable TableA record and generates suitable values for start_date and end_date, again depending on whether - and how - the records overlap. It's entirely possible I've missed some edge cases in here.

You mentioned that there might be contiguous ranges to collapse for TableB too, and you could give it a similar treatment to the one I've shown for TableA. I'm not sure though that doing it there is necessarily the best or clearest point, even with just one table needing it; I've only really done it there because that was how you described the process.

If you modify the recursive CTE to be against the base tables (possibly simplifying it slightly in the process) you could apply a gaps-and-islands approach to that result set rather than the individual tables, so it doesn't matter which table the gaps are caused by.

Upvotes: 4

Related Questions