Reputation: 1656
This question is quite related to another one I posted recently, but I am posting a new one as this provides a little more complexity in resolving. I am seeking the help of some Oracle ninjas and rockstars out there and I feel that is a good challenge and exercise on their expertise.
Basically I have two tables, TableA and TableB.
-- For TableA
CREATE TABLE TableA
(
ID VARCHAR2(10),
LOCN VARCHAR2(10),
START_DATE DATE,
END_DATE DATE
)
STORAGE (
BUFFER_POOL DEFAULT
)
LOGGING
NOCOMPRESS
NOCACHE
NOPARALLEL
NOMONITORING
/
-- Populate TableA
INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1P1', '01', TO_DATE('02/04/1996', 'MM/DD/YYYY'), TO_DATE('02/22/1996', 'MM/DD/YYYY');
INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1P1', '01', TO_DATE('02/23/1996', 'MM/DD/YYYY'), TO_DATE('05/28/2002', 'MM/DD/YYYY');
INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1P1', '01', TO_DATE('05/29/2002', 'MM/DD/YYYY'), TO_DATE('05/03/2005', 'MM/DD/YYYY');
INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1P1', '01', TO_DATE('05/04/2005', 'MM/DD/YYYY'), TO_DATE('05/04/2005', 'MM/DD/YYYY');
INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1P2', '30', TO_DATE('01/31/1996', 'MM/DD/YYYY'), TO_DATE('02/06/1996', 'MM/DD/YYYY');
INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1P2', '02', TO_DATE('02/07/1996', 'MM/DD/YYYY'), TO_DATE('02/13/1996', 'MM/DD/YYYY');
INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1P2', '02', TO_DATE('02/14/1996', 'MM/DD/YYYY'), TO_DATE('01/01/2099', 'MM/DD/YYYY');
INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1P3', '03', TO_DATE('02/07/1996', 'MM/DD/YYYY'), TO_DATE('02/13/1996', 'MM/DD/YYYY');
INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1P3', '03', TO_DATE('02/14/1996', 'MM/DD/YYYY'), TO_DATE('01/01/2099', 'MM/DD/YYYY');
INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('1S4', '42', TO_DATE('11/06/2001', 'MM/DD/YYYY'), TO_DATE('01/01/2099', 'MM/DD/YYYY');
INSERT INTO TableA(ID, LOCN, START_DATE, END_DATE)
VALUES('3S4', '42', TO_DATE('11/06/2001', 'MM/DD/YYYY'), TO_DATE('01/01/2099', 'MM/DD/YYYY');
-- For TableB
CREATE TABLE TableB
(
ID VARCHAR2(10),
POSTING VARCHAR2(20),
DESCRIPTION VARCHAR2(100),
OTHER_ID VARCHAR2(10),
START_DATE DATE,
END_DATE DATE
)
STORAGE (
BUFFER_POOL DEFAULT
)
LOGGING
NOCOMPRESS
NOCACHE
NOPARALLEL
NOMONITORING
/
INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('1P1', 'PROFESSOR', 'Sch 1 Quad 1 Area', 'P1', '02/04/1996', '01/01/2099');
INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('1P2', 'PROFESSOR', 'Sch 1 Quad 2 Area', 'P2', '01/31/1996', '01/01/2099');
INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('1P3', 'PROFESSOR', 'Sch 1 Quad 3 Area', 'P3', '02/05/1996', '01/01/2099');
INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('1S4', 'SUPERVISOR', 'Sch 1 CO Supervisor 4', '1S4', '02/05/1996', '03/18/2002');
INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('1S4', 'SUPERINTENDENT', 'Sch 1 CD Superintendent', '1S4', '03/19/2002', '06/09/2009');
INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('1S4', 'SUPERVISOR', 'Sch 1 CO Supervisor 4', '1S4', '06/10/2009', '01/01/2099');
INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('2S5', 'SUPERVISOR', 'Sch 2 CAO Supervisor 5', '2S5', '10/26/2002', '06/09/2009');
INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('2S5', 'SUPERINTENDENT', 'Sch 2 CAO Superintendent 5', '2S5', '06/10/2009', '07/14/2009');
INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('2S5', 'SUPERINTENDENT', 'Sch 2 CAO Superintendent 5', 'S5', '07/15/2009', '01/01/2099');
INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('3S4', 'SUPERVISOR', 'Sch 3 CO Supervisor 4', '3S4', '02/05/1996', '03/18/2002');
INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('3S4', 'SUPERINTENDENT', 'Sch 3 CD Superintendent', '3S4', '03/19/2002', '06/09/2009');
INSERT INTO TableB(ID, POSTING, DESCRIPTION, OTHER_ID, START_DATE, END_DATE)
VALUES('3S4', 'SUPERVISOR', 'Sch 3 CO Supervisor 4', '3S4', '06/10/2009', '01/01/2099');
The process goes like this: In TableA, all records with the same ID, LOCN and having continuous START_DATE and END_DATE dates will be combined.
ID LOCN START_DATE END_DATE
1P1 01 02/04/1996 05/04/2005
1P2 30 01/31/1996 02/06/1996
1P2 02 02/07/1996 01/01/2099
1P3 03 02/07/1996 01/01/2099
1S4 42 11/06/2001 01/01/2099
3S4 42 11/06/2001 01/01/2099
In TableB, all records with the same ID, POSTING, OTHER_ID and continuous START_DATE and END_DATE will also be combined. (I believe there's no data anyway that may be combined from this table).
ID POSTING DESCRIPTION OTHER_ID START_DATE END_DATE
1P1 PROFESSOR Sch 1 Quad 1 Area P1 02/04/1996 01/01/2099
1P2 PROFESSOR Sch 1 Quad 2 Area P2 01/31/1996 01/01/2099
1P3 PROFESSOR Sch 1 Quad 3 Area P3 02/05/1996 01/01/2099
1S4 SUPERVISOR Sch 1 CO Supervisor 4 1S4 02/05/1996 03/18/2002
1S4 SUPERINTENDENT Sch 1 CD Superintendent 1S4 03/19/2002 06/09/2009
1S4 SUPERVISOR Sch 1 CO Supervisor 4 1S4 06/10/2009 01/01/2099
2S5 SUPERVISOR Sch 2 CAO Supervisor 5 2S5 10/26/2002 06/09/2009
2S5 SUPERINTENDENT Sch 2 CAO Superintendent 5 2S5 06/10/2009 07/14/2009
2S5 SUPERINTENDENT Sch 2 CAO Superintendent 5 S5 07/15/2009 01/01/2099
3S4 SUPERVISOR Sch 3 CO Supervisor 4 3S4 02/05/1996 03/18/2002
3S4 SUPERINTENDENT Sch 3 CD Superintendent 3S4 03/19/2002 06/09/2009
3S4 SUPERVISOR Sch 3 CO Supervisor 4 3S4 06/10/2009 01/01/2099
Merge the records from TableA and TableB based on the ID. The LOCN column will be added to Table B and will only be carried forward based on the date range from TableA. The resulting data should look like this:
ID UNIT_TYPE DESCRIPTION OTHER_ID START_DATE END_DATE LOCN
1P1 PROFESSOR Sch 1 Quad 1 Area P1 02/04/1996 05/04/2005 01
1P1 PROFESSOR Sch 1 Quad 1 Area P1 05/05/2005 01/01/2099 {NULL}
1P2 PROFESSOR Sch 1 Quad 2 Area P2 01/31/1996 02/06/1996 30
1P2 PROFESSOR Sch 1 Quad 2 Area P2 02/07/1996 01/01/2099 02
1P3 PROFESSOR Sch 1 Quad 3 Area P3 02/05/1996 02/06/1996 {NULL}
1P3 PROFESSOR Sch 1 Quad 3 Area P3 02/07/1996 01/01/2099 03
1S4 SUPERVISOR Sch 1 CO Supervisor 4 1S4 02/05/1996 11/05/2001 {NULL}
1S4 SUPERVISOR Sch 1 CO Supervisor 4 1S4 11/06/2001 03/18/2002 42
1S4 SUPERINTENDENT Sch 1 CD Superintendent 1S4 03/19/2002 06/09/2009 42
1S4 SUPERVISOR Sch 1 CO Supervisor 4 1S4 06/10/2009 01/01/2099 42
2S5 SUPERVISOR Sch 2 CAO Supervisor 5 2S5 10/26/2002 06/09/2009 {NULL}
2S5 SUPERINTENDENT Sch 2 CAO Superintendent 5 2S5 06/10/2009 07/14/2009 {NULL}
2S5 SUPERINTENDENT Sch 2 CAO Superintendent 5 S5 07/15/2009 01/01/2099 {NULL}
3S4 SUPERVISOR Sch 3 CO Supervisor 4 3S4 02/05/1996 11/05/2001 {NULL}
3S4 SUPERVISOR Sch 3 CO Supervisor 4 3S4 11/06/2001 03/18/2002 42
3S4 SUPERINTENDENT Sch 3 CD Superintendent 3S4 03/19/2002 06/09/2009 42
3S4 SUPERVISOR Sch 3 CO Supervisor 4 3S4 06/10/2009 01/01/2099 42
Would love to hear any approach feasible to solve this. Thans a lot.
Addition: This is the query I have written so far to collapse the records in TableA
SELECT ID, LOCN, TO_CHAR(MIN(START_DATE), 'MM/DD/YYYY') START_DATE, TO_CHAR(MAX(END_DATE), 'MM/DD/YYYY') END_DATE
FROM
(
SELECT ID, LOCN, START_DATE, END_DATE, MAX(GRP) OVER (ORDER BY ID, START_DATE) GRP
FROM
(
SELECT ID, LOCN,
CASE WHEN START_DATE - LAG(END_DATE) OVER (PARTITION BY ID, LOCN ORDER BY START_DATE ASC) <= 1 THEN
NULL
ELSE
ROWNUM
END GRP,
START_DATE,
NVL(END_DATE, SYSDATE) END_DATE
FROM TableA
ORDER BY ID ASC, START_DATE ASC
)
)
GROUP BY ID, LOCN, GRP
ORDER BY ID ASC, START_DATE ASC;
Upvotes: 2
Views: 1913
Reputation: 191235
Since the rockstars are busy leading their debauched (if well-earned) lifestyles and the ninjas look like they'll be busy for a while, I'll have a go...
The way you've laid it out, you want to collapse contiguous records in TableA
first and use that result against a (possibly collapsed) TableB
. I'm not sure doing that as a separate step is ideal for solving the overall problem, but I'll go with it for now. The general way I've found simplest to collapse the rows is something like:
select id, locn, max(start_date) as start_date, max(end_date) as end_date
from (
select id, locn,
case when start_date = lag_end_date + interval '1' day then null
else start_date end as start_date,
case when end_date = lead_start_date - interval '1' day then null
else end_date end as end_date,
row_number() over (partition by id order by start_date)
- row_number() over (partition by id, locn
order by start_date) as chain
from (
select id, locn, start_date, end_date,
lead(start_date) over (partition by id, locn
order by start_date) as lead_start_date,
lag(end_date) over (partition by id, locn
order by start_date) as lag_end_date
from TableA
)
)
group by id, locn, chain
order by 1, 3, 2;
ID LOCN START_DATE END_DATE
---------- ---------- ---------- ----------
1P1 01 02/04/1996 05/04/2005
1P2 02 02/07/1996 01/01/2099
1P2 30 01/31/1996 02/06/1996
1P3 03 02/07/1996 01/01/2099
1S4 42 11/06/2001 01/01/2099
3S4 42 11/06/2001 01/01/2099
The innermost select
uses lead
and lag
to peek at adjacent rows (and you hinted at that in your previous question).
The next layer sets contiguous values (i.e. where one row's start date is the day after the previous row's end date) to null; if you run just that part you'll see the contiguous range start and ends emerge. It also adds a chain
pseudocolumn which lets it cope with an id
switching back to a previously-used locn
; say having 1P2
go back to locn=30
. (This is an approach I originally saw here, but see more about gaps and islands too). Without that all 'islands' of an id/locn
would be treated as one block and you'd end up with overlapping date ranges.
The outer layer user min
and max
to remove the nulls and produce the final result.
Using that you could - if you're on 11gR2 - use a recursive CTE to join recursively to get all the combinations. This is only my second real stab at one of these so others may well point out flaws or improvements, if they can tear themselves away from their M&Ms... might give you some pointers though.
with a as (
select id, locn, max(start_date) as start_date, max(end_date) as end_date
from (
select id, locn,
case when start_date = lag_end_date + interval '1' day then null
else start_date end as start_date,
case when end_date = lead_start_date - interval '1' day then null
else end_date end as end_date,
row_number() over (partition by id order by start_date)
- row_number() over (partition by id, locn
order by start_date) as chain
from (
select id, locn, start_date, end_date,
lead(start_date) over (partition by id, locn
order by start_date) as lead_start_date,
lag(end_date) over (partition by id, locn
order by start_date) as lag_end_date
from TableA
)
)
group by id, locn, chain
),
b as (
select id, posting, description, other_id, start_date, end_date,
row_number() over (partition by id, posting, description,
other_id order by start_date, end_date) as rn
from TableB
),
r (id, posting, description, other_id, rn, start_date, end_date, locn) as (
select b.id, b.posting, b.description, b.other_id, b.rn,
b.start_date,
case
when not (a.start_date > b.end_date or a.end_date < b.start_date)
and a.start_date <= b.end_date and a.end_date < b.end_date
then a.end_date
when not (a.start_date > b.end_date or a.end_date < b.start_date)
and a.start_date <= b.end_date and a.start_date > b.start_date
then a.start_date - interval '1' day
else b.end_date
end as end_date,
case
when a.start_date <= b.start_date and a.end_date >= b.start_date
then a.locn
end
from b
left join (
select id, locn, start_date, end_date,
row_number() over (partition by id order by start_date) as rn
from a
) a on a.id = b.id
and a.rn = 1
union all
select b.id, b.posting, b.description, b.other_id, b.rn,
case
when a.start_date is null then r.end_date + interval '1' day
else a.start_date
end as start_date,
case
when a.start_date is null then b.end_date
when not (a.start_date > r.end_date or a.end_date < r.start_date)
then least(a.end_date, b.end_date)
when a.end_date < b.end_date then a.start_date - interval '1' day
else b.end_date
end as end_date,
a.locn
from b
join r on r.id = b.id
and r.posting = b.posting
and r.description = b.description
and r.other_id = b.other_id
and r.rn = b.rn
and r.start_date = b.start_date
and r.end_date < b.end_date
left join a on a.id = r.id
and a.start_date > r.end_date
)
select id, posting as unit_type, description, other_id,
start_date, end_date, locn
from r
order by id, start_date;
This gets the result you wanted I believe:
ID UNIT_TYPE DESCRIPTION OTHER_ID START_DATE END_DATE LOCN
---------- -------------------- ------------------------------ ---------- ---------- ---------- ----------
1P1 PROFESSOR Sch 1 Quad 1 Area P1 02/04/1996 05/04/2005 01
1P1 PROFESSOR Sch 1 Quad 1 Area P1 05/05/2005 01/01/2099
1P2 PROFESSOR Sch 1 Quad 2 Area P2 01/31/1996 02/06/1996 30
1P2 PROFESSOR Sch 1 Quad 2 Area P2 02/07/1996 01/01/2099 02
1P3 PROFESSOR Sch 1 Quad 3 Area P3 02/05/1996 02/06/1996
1P3 PROFESSOR Sch 1 Quad 3 Area P3 02/07/1996 01/01/2099 03
1S4 SUPERVISOR Sch 1 CO Supervisor 4 1S4 02/05/1996 11/05/2001
1S4 SUPERVISOR Sch 1 CO Supervisor 4 1S4 11/06/2001 03/18/2002 42
1S4 SUPERINTENDENT Sch 1 CD Superintendent 1S4 03/19/2002 06/09/2009 42
1S4 SUPERVISOR Sch 1 CO Supervisor 4 1S4 06/10/2009 01/01/2099 42
2S5 SUPERVISOR Sch 2 CAO Supervisor 5 2S5 10/26/2002 06/09/2009
2S5 SUPERINTENDENT Sch 2 CAO Superintendent 5 2S5 06/10/2009 07/14/2009
2S5 SUPERINTENDENT Sch 2 CAO Superintendent 5 S5 07/15/2009 01/01/2099
3S4 SUPERVISOR Sch 3 CO Supervisor 4 3S4 02/05/1996 11/05/2001
3S4 SUPERVISOR Sch 3 CO Supervisor 4 3S4 11/06/2001 03/18/2002 42
3S4 SUPERINTENDENT Sch 3 CD Superintendent 3S4 03/19/2002 06/09/2009 42
3S4 SUPERVISOR Sch 3 CO Supervisor 4 3S4 06/10/2009 01/01/2099 42
17 rows selected.
This is using three CTEs. a
is as described above, a collapsed version of TableA
. b
is TableB
but with the addition of a row number column I think I need later to keep records in step during the recursion. r
is where the fun starts.
The first part of r
generates the initial data for each TableB
entry, with matching values from TableA
if appropriate - but only from the first matching record if there might be more than one. The tricky bit here is figuring out what end_date
should be. If there is no overlapping TableA
record at all then it can just be the TableB
end date; if there is but it starts after the TableB
record then this needs to end immediately before the TableA
kicks in. Otherwise it depends if the TableA
record ends before or after the TableB
one.
Running just that part:
with a as (...), b as (...)
select b.id, b.posting, b.description, b.other_id, b.rn,
b.start_date,
case
when not (a.start_date > b.end_date or a.end_date < b.start_date)
and a.start_date <= b.end_date and a.end_date < b.end_date
then a.end_date
when not (a.start_date > b.end_date or a.end_date < b.start_date)
and a.start_date <= b.end_date and a.start_date > b.start_date
then a.start_date - interval '1' day
else b.end_date
end as end_date,
case
when a.start_date <= b.start_date and a.end_date >= b.start_date
then a.locn
end
from b
left join (
select id, locn, start_date, end_date,
row_number() over (partition by id order by start_date) as rn
from a
) a on a.id = b.id
and a.rn = 1
order by id, start_date;
... gives this (description suppressed for readability):
ID UNIT_TYPE OTHER_ID START_DATE END_DATE LOCN
---------- -------------------- ---------- ---------- ---------- ----------
1P1 PROFESSOR P1 02/04/1996 05/04/2005 01
1P2 PROFESSOR P2 01/31/1996 02/06/1996 30
1P3 PROFESSOR P3 02/05/1996 02/06/1996
1S4 SUPERVISOR 1S4 02/05/1996 11/05/2001
1S4 SUPERINTENDENT 1S4 03/19/2002 06/09/2009 42
1S4 SUPERVISOR 1S4 06/10/2009 01/01/2099 42
2S5 SUPERVISOR 2S5 10/26/2002 06/09/2009
2S5 SUPERINTENDENT 2S5 06/10/2009 07/14/2009
2S5 SUPERINTENDENT S5 07/15/2009 01/01/2099
3S4 SUPERVISOR 3S4 02/05/1996 11/05/2001
3S4 SUPERINTENDENT 3S4 03/19/2002 06/09/2009 42
3S4 SUPERVISOR 3S4 06/10/2009 01/01/2099 42
12 rows selected.
For IP3
, there is initially no matching TableA
record, but notice that end_date
is set to the day before the one that does match later starts.
The second part of r
, the union all
, is the recursive part. For each TableB
record, it's joining back to itself looking for a record where the generated end_date
is earlier than the original's, as is the case for IP3
, which means there is a period of time still to fill in. It then looks for a suitable TableA
record and generates suitable values for start_date
and end_date
, again depending on whether - and how - the records overlap. It's entirely possible I've missed some edge cases in here.
You mentioned that there might be contiguous ranges to collapse for TableB
too, and you could give it a similar treatment to the one I've shown for TableA
. I'm not sure though that doing it there is necessarily the best or clearest point, even with just one table needing it; I've only really done it there because that was how you described the process.
If you modify the recursive CTE to be against the base tables (possibly simplifying it slightly in the process) you could apply a gaps-and-islands approach to that result set rather than the individual tables, so it doesn't matter which table the gaps are caused by.
Upvotes: 4