Reputation: 741
I'm stuck with the following problem and need help:
An object has properties that are calculated every day.
They are stored in a key-value historical table.
Property is mistakenly stored even if it was not changed.
I need a query that will group this data set by "actual values":
Here is a dataset example.
with obj_val_hist as
(
select 123 obj_id, 'k_1' key, 'A' value_, to_date('01.01.2021', 'DD.MM.YYYY') start_dt, to_date('01.01.2021', 'DD.MM.YYYY') end_dt from dual union all
select 123 obj_id, 'k_1' key, 'A' value_, to_date('02.01.2021', 'DD.MM.YYYY') start_dt, to_date('02.01.2021', 'DD.MM.YYYY') end_dt from dual union all
select 123 obj_id, 'k_1' key, 'A' value_, to_date('03.01.2021', 'DD.MM.YYYY') start_dt, to_date('03.01.2021', 'DD.MM.YYYY') end_dt from dual union all
select 123 obj_id, 'k_1' key, 'B' value_, to_date('04.01.2021', 'DD.MM.YYYY') start_dt, to_date('04.01.2021', 'DD.MM.YYYY') end_dt from dual union all
select 123 obj_id, 'k_1' key, 'B' value_, to_date('05.01.2021', 'DD.MM.YYYY') start_dt, to_date('05.01.2021', 'DD.MM.YYYY') end_dt from dual union all
select 123 obj_id, 'k_1' key, 'B' value_, to_date('06.01.2021', 'DD.MM.YYYY') start_dt, to_date('06.01.2021', 'DD.MM.YYYY') end_dt from dual union all
select 123 obj_id, 'k_1' key, 'A' value_, to_date('07.01.2021', 'DD.MM.YYYY') start_dt, to_date('07.01.2021', 'DD.MM.YYYY') end_dt from dual union all
select 123 obj_id, 'k_1' key, 'A' value_, to_date('08.01.2021', 'DD.MM.YYYY') start_dt, to_date('08.01.2021', 'DD.MM.YYYY') end_dt from dual union all
select 123 obj_id, 'k_1' key, 'A' value_, to_date('09.01.2021', 'DD.MM.YYYY') start_dt, to_date('09.01.2021', 'DD.MM.YYYY') end_dt from dual
)
select * from obj_val_hist where obj_id = 123;
Data set:
obj_id | key | value | start_date | end_date |
---|---|---|---|---|
123 | k_1 | A | 01.01.2021 | 01.01.2021 |
123 | k_1 | A | 02.01.2021 | 02.01.2021 |
123 | k_1 | A | 03.01.2021 | 03.01.2021 |
123 | k_1 | B | 04.01.2021 | 04.01.2021 |
123 | k_1 | B | 05.01.2021 | 05.01.2021 |
123 | k_1 | B | 06.01.2021 | 06.01.2021 |
123 | k_1 | A | 07.01.2021 | 07.01.2021 |
123 | k_1 | A | 08.01.2021 | 08.01.2021 |
123 | k_1 | A | 09.01.2021 | 09.01.2021 |
Expected result:
obj_id | key | value | start_date | end_date |
---|---|---|---|---|
123 | k_1 | A | 01.01.2021 | 03.01.2021 |
123 | k_1 | B | 04.01.2021 | 06.01.2021 |
123 | k_1 | A | 07.01.2021 | 09.01.2021 |
This table contains values for million objects.
It is queried by obj_id and has an index on it.
Performance is a key point so using stored functions is most probably not an option.
This query will be a small part of a big view that is used by an external system.
I expected that there should be an analytic function suited for such a problem.
Something like dense_rank but with the possibility to order by one column (start_dt) but increase value when another column (value_) gets a different value.
But I didn't find one.
Upvotes: 0
Views: 98
Reputation: 6751
You may use match_recognize
for this, which can also handle gaps in dates and is quite efficient and natural to read:
create table t ( obj_id , key_ , value_ , start_date , end_date ) as select 123, 'k_1', 'A', to_date('01.01.2021', 'dd.mm.yyyy'), to_date('01.01.2021', 'dd.mm.yyyy') from dual union all select 123, 'k_1', 'A', to_date('02.01.2021', 'dd.mm.yyyy'), to_date('02.01.2021', 'dd.mm.yyyy') from dual union all select 123, 'k_1', 'A', to_date('03.01.2021', 'dd.mm.yyyy'), to_date('03.01.2021', 'dd.mm.yyyy') from dual union all select 123, 'k_1', 'B', to_date('04.01.2021', 'dd.mm.yyyy'), to_date('04.01.2021', 'dd.mm.yyyy') from dual union all select 123, 'k_1', 'B', to_date('05.01.2021', 'dd.mm.yyyy'), to_date('05.01.2021', 'dd.mm.yyyy') from dual union all select 123, 'k_1', 'B', to_date('06.01.2021', 'dd.mm.yyyy'), to_date('06.01.2021', 'dd.mm.yyyy') from dual union all select 123, 'k_1', 'A', to_date('07.01.2021', 'dd.mm.yyyy'), to_date('07.01.2021', 'dd.mm.yyyy') from dual union all select 123, 'k_1', 'A', to_date('08.01.2021', 'dd.mm.yyyy'), to_date('08.01.2021', 'dd.mm.yyyy') from dual union all select 123, 'k_1', 'A', to_date('09.01.2021', 'dd.mm.yyyy'), to_date('09.01.2021', 'dd.mm.yyyy') from dual union all /*Let's skip 10.01*/ select 123, 'k_1', 'A', to_date('11.01.2021', 'dd.mm.yyyy'), to_date('11.01.2021', 'dd.mm.yyyy') from dual union all /*And extent validity period for some record*/ select 123, 'k_1', 'A', to_date('12.01.2021', 'dd.mm.yyyy'), to_date('13.01.2021', 'dd.mm.yyyy') from dual union all select 123, 'k_1', 'A', to_date('14.01.2021', 'dd.mm.yyyy'), to_date('14.01.2021', 'dd.mm.yyyy') from dual
select * from t match_recognize ( /*For each ID and KEY*/ partition by obj_id, key_ order by start_date asc /*Output attributes*/ measures /*start_date of the first row in match group*/ final first(start_date) as min_start_date, /*end_date of the last row in match group*/ final last(end_date) as max_end_date, /*value itself as it is constant for match group*/ value_ as val /*First row and any consequtive matches*/ pattern (init A*) define /*Consequtive are the rows which have the same value in value_ field and start_date of the next row is not farther than 1 day from end_date of the previous row */ A as prev(value_) = value_ and prev(end_date) + 1 = start_date )
OBJ_ID | KEY_ | MIN_START_DATE | MAX_END_DATE | VAL -----: | :--- | :------------- | :----------- | :-- 123 | k_1 | 01-JAN-21 | 03-JAN-21 | A 123 | k_1 | 04-JAN-21 | 06-JAN-21 | B 123 | k_1 | 07-JAN-21 | 09-JAN-21 | A 123 | k_1 | 11-JAN-21 | 14-JAN-21 | A
db<>fiddle here
Upvotes: 1
Reputation: 1269923
If you indeed have data every day, then you can use the following relatively simple logic. The subquery calculate when the value changes. The outer query then calculates the end date by looking at the date in the next row:
select obj_id, key, value_, start_dt,
coalesce(lead(start_dt) over (partition by obj_id, key order by start_dt) - interval '1' day, max_end_dt)
from (select ovh.*,
lag(value_) over (partition by obj_id, key order by start_dt) as prev_value_,
max(end_dt) over (partition by obj_id, key) as max_end_dt
from obj_val_hist ovh
where obj_id = 123
) ovh
where prev_value_ is null or prev_value_ <> value_;
However, your data suggests that you could have a much more complicated problem. You have two dates in the row, a start date and end date. These could, in theory, overlap or have gaps. You can handle that by assigning groups when a new key/value pair starts and then aggregating:
select obj_id, key, value_, min(start_dt), max(end_dt)
from (select ovh.*,
sum(case when prev_end_dt >= start_dt - interval '1' day then 0 else 1 end) over (partition by obj_id, key order by start_dt) as grp
from (select ovh.*,
max(end_dt) over (partition by obj_id, key, value_
order by start_dt
range between unbounded preceding and interval '1' day preceding
) as prev_end_dt
from obj_val_hist ovh
) ovh
) ovh
group by obj_id, key, value_, grp;
Here is a db<>fiddle.
Upvotes: 1