Anton Zaviriukhin
Anton Zaviriukhin

Reputation: 741

Group historical data

I'm stuck with the following problem and need help:

An object has properties that are calculated every day.
They are stored in a key-value historical table.
Property is mistakenly stored even if it was not changed.

I need a query that will group this data set by "actual values":

Here is a dataset example.

with obj_val_hist as
(
select 123 obj_id, 'k_1' key, 'A' value_, to_date('01.01.2021', 'DD.MM.YYYY') start_dt, to_date('01.01.2021', 'DD.MM.YYYY') end_dt from dual union all
select 123 obj_id, 'k_1' key, 'A' value_, to_date('02.01.2021', 'DD.MM.YYYY') start_dt, to_date('02.01.2021', 'DD.MM.YYYY') end_dt from dual union all
select 123 obj_id, 'k_1' key, 'A' value_, to_date('03.01.2021', 'DD.MM.YYYY') start_dt, to_date('03.01.2021', 'DD.MM.YYYY') end_dt from dual union all
select 123 obj_id, 'k_1' key, 'B' value_, to_date('04.01.2021', 'DD.MM.YYYY') start_dt, to_date('04.01.2021', 'DD.MM.YYYY') end_dt from dual union all
select 123 obj_id, 'k_1' key, 'B' value_, to_date('05.01.2021', 'DD.MM.YYYY') start_dt, to_date('05.01.2021', 'DD.MM.YYYY') end_dt from dual union all
select 123 obj_id, 'k_1' key, 'B' value_, to_date('06.01.2021', 'DD.MM.YYYY') start_dt, to_date('06.01.2021', 'DD.MM.YYYY') end_dt from dual union all
select 123 obj_id, 'k_1' key, 'A' value_, to_date('07.01.2021', 'DD.MM.YYYY') start_dt, to_date('07.01.2021', 'DD.MM.YYYY') end_dt from dual union all
select 123 obj_id, 'k_1' key, 'A' value_, to_date('08.01.2021', 'DD.MM.YYYY') start_dt, to_date('08.01.2021', 'DD.MM.YYYY') end_dt from dual union all
select 123 obj_id, 'k_1' key, 'A' value_, to_date('09.01.2021', 'DD.MM.YYYY') start_dt, to_date('09.01.2021', 'DD.MM.YYYY') end_dt from dual 
)
select * from obj_val_hist where obj_id = 123;

Data set:

obj_id key value start_date end_date
123 k_1 A 01.01.2021 01.01.2021
123 k_1 A 02.01.2021 02.01.2021
123 k_1 A 03.01.2021 03.01.2021
123 k_1 B 04.01.2021 04.01.2021
123 k_1 B 05.01.2021 05.01.2021
123 k_1 B 06.01.2021 06.01.2021
123 k_1 A 07.01.2021 07.01.2021
123 k_1 A 08.01.2021 08.01.2021
123 k_1 A 09.01.2021 09.01.2021

Expected result:

obj_id key value start_date end_date
123 k_1 A 01.01.2021 03.01.2021
123 k_1 B 04.01.2021 06.01.2021
123 k_1 A 07.01.2021 09.01.2021

This table contains values for million objects.
It is queried by obj_id and has an index on it.
Performance is a key point so using stored functions is most probably not an option.
This query will be a small part of a big view that is used by an external system.

I expected that there should be an analytic function suited for such a problem.
Something like dense_rank but with the possibility to order by one column (start_dt) but increase value when another column (value_) gets a different value. But I didn't find one.

Upvotes: 0

Views: 98

Answers (2)

astentx
astentx

Reputation: 6751

You may use match_recognize for this, which can also handle gaps in dates and is quite efficient and natural to read:

create table t (
  obj_id
  , key_
  , value_
  , start_date
  , end_date
)
as
select 123, 'k_1', 'A', to_date('01.01.2021', 'dd.mm.yyyy'), to_date('01.01.2021', 'dd.mm.yyyy') from dual union all
select 123, 'k_1', 'A', to_date('02.01.2021', 'dd.mm.yyyy'), to_date('02.01.2021', 'dd.mm.yyyy') from dual union all
select 123, 'k_1', 'A', to_date('03.01.2021', 'dd.mm.yyyy'), to_date('03.01.2021', 'dd.mm.yyyy') from dual union all
select 123, 'k_1', 'B', to_date('04.01.2021', 'dd.mm.yyyy'), to_date('04.01.2021', 'dd.mm.yyyy') from dual union all
select 123, 'k_1', 'B', to_date('05.01.2021', 'dd.mm.yyyy'), to_date('05.01.2021', 'dd.mm.yyyy') from dual union all
select 123, 'k_1', 'B', to_date('06.01.2021', 'dd.mm.yyyy'), to_date('06.01.2021', 'dd.mm.yyyy') from dual union all
select 123, 'k_1', 'A', to_date('07.01.2021', 'dd.mm.yyyy'), to_date('07.01.2021', 'dd.mm.yyyy') from dual union all
select 123, 'k_1', 'A', to_date('08.01.2021', 'dd.mm.yyyy'), to_date('08.01.2021', 'dd.mm.yyyy') from dual union all
select 123, 'k_1', 'A', to_date('09.01.2021', 'dd.mm.yyyy'), to_date('09.01.2021', 'dd.mm.yyyy') from dual union all
/*Let's skip 10.01*/
select 123, 'k_1', 'A', to_date('11.01.2021', 'dd.mm.yyyy'), to_date('11.01.2021', 'dd.mm.yyyy') from dual union all
/*And extent validity period for some record*/
select 123, 'k_1', 'A', to_date('12.01.2021', 'dd.mm.yyyy'), to_date('13.01.2021', 'dd.mm.yyyy') from dual union all
select 123, 'k_1', 'A', to_date('14.01.2021', 'dd.mm.yyyy'), to_date('14.01.2021', 'dd.mm.yyyy') from dual
select *
from t
match_recognize (
  /*For each ID and KEY*/
  partition by obj_id, key_
  
  order by start_date asc
  /*Output attributes*/
  measures
    /*start_date of the first row in match group*/
    final first(start_date) as min_start_date,
    /*end_date of the last row in match group*/
    final last(end_date) as max_end_date,
    /*value itself as it is constant for match group*/
    value_ as val
  /*First row and any consequtive matches*/
  pattern (init A*)
  define
    /*Consequtive are the rows which have the same value in value_ field
      and start_date of the next row is not farther than
      1 day from end_date of the previous row
    */
    A as prev(value_) = value_
      and prev(end_date) + 1 = start_date
)
OBJ_ID | KEY_ | MIN_START_DATE | MAX_END_DATE | VAL
-----: | :--- | :------------- | :----------- | :--
   123 | k_1  | 01-JAN-21      | 03-JAN-21    | A  
   123 | k_1  | 04-JAN-21      | 06-JAN-21    | B  
   123 | k_1  | 07-JAN-21      | 09-JAN-21    | A  
   123 | k_1  | 11-JAN-21      | 14-JAN-21    | A  

db<>fiddle here

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269923

If you indeed have data every day, then you can use the following relatively simple logic. The subquery calculate when the value changes. The outer query then calculates the end date by looking at the date in the next row:

select obj_id, key, value_, start_dt,
       coalesce(lead(start_dt) over (partition by obj_id, key order by start_dt) - interval '1' day, max_end_dt)
from (select ovh.*,
             lag(value_) over (partition by obj_id, key order by start_dt) as prev_value_,
             max(end_dt) over (partition by obj_id, key) as max_end_dt
      from obj_val_hist ovh
      where obj_id = 123
     ) ovh
where prev_value_ is null or prev_value_ <> value_;

However, your data suggests that you could have a much more complicated problem. You have two dates in the row, a start date and end date. These could, in theory, overlap or have gaps. You can handle that by assigning groups when a new key/value pair starts and then aggregating:

select obj_id, key, value_, min(start_dt), max(end_dt)
from (select ovh.*,
             sum(case when prev_end_dt >= start_dt - interval '1' day then 0 else 1 end) over (partition by obj_id, key order by start_dt) as grp
      from (select ovh.*,
                   max(end_dt) over (partition by obj_id, key, value_
                                     order by start_dt
                                     range between unbounded preceding and interval '1' day preceding
                                    ) as prev_end_dt
            from obj_val_hist ovh
           ) ovh
     ) ovh
group by obj_id, key, value_, grp;

Here is a db<>fiddle.

Upvotes: 1

Related Questions