Rriskit
Rriskit

Reputation: 565

Oracle window function - retrieve elements up to next occurence of value

Given a start date I have a long list of event-logs for many id's. For each id and starting point time-ordered sequences starting with an 'success'-entry are selected. I need to retrieve only id's with failure as next event and only transactions until 'success' (or up to the last entry if only failures follow). 'failure'-entries are sparse. The number of potentially following failures has no 'natural' boundary.

Simplified Input (date format dd.mm.yyyy):

id          timestamp            event
123         12.09.2019          success
123         13.09.2019          success
124         12.09.2019          success
124         15.09.2019          failure
124         16.09.2019          success
124         17.09.2019          success
124         18.09.2019          failure
126         12.09.2019          success
126         16.09.2019          failure
126         17.09.2019          failure
128         …           

Expected Output:
    124         12.09.2019          success
    124         15.09.2019          failure
    124         16.09.2019          success
    126         12.09.2019          success
    126         16.09.2019          failure
    126         17.09.2019          failure

123 is discarded because the next event is success. Everything after the first success for 124 is discarded. 126 never reaches the 'success'-stage, so everything is retrieved.

I can test via lag/lead if the next transaction is success and exclude those - but how to find the next 'success'-row (if it might not even exist?). I solved this problem easily in python counting the 'success'-entries per id-group but I generate way to much IO in transfering all the data. Is there a way to count the occurrences of success per id maybe in a partition over - clause and cut after the 2?

I access an oracle 11g via cx_oracle/python from a jupyter-notebook (i.e. I pass sql-statements to the db). Each day has about 50k ids and up to a double digit million transactions.

Upvotes: 1

Views: 282

Answers (3)

Rriskit
Rriskit

Reputation: 565

Based on Ponder Stibbons approach I formed two flags. The simple one is for the case that the next event already is success. In this case everything should be discarded, even if some failure follow afterward. The success_flag is trickier, I used the row_number to identify rows with the combination of 'failure' as current event and success in the following (lead) row. Taking the minimum of those rows yields the place 'critical_line' where I need to cut and discard the following rows. If no success is reached critical line is null. Performance is ok, less than 3 Minutes. As the final code differs significantly from Ponder Stibbons I chose to put it as an answer.

with
  test (id, timestamp, event) as
    -- sample data
    (select 123, '12.09.2019', 'success' from dual union all
     select 123, '13.09.2019', 'success' from dual union all
     select 123, '15.09.2019', 'failure' from dual union all
     select 124, '12.09.2019', 'success' from dual union all
     select 124, '15.09.2019', 'failure' from dual union all
     select 124, '16.09.2019', 'success' from dual union all
     select 124, '17.09.2019', 'success' from dual union all
     select 124, '18.09.2019', 'failure' from dual union all
     select 126, '12.09.2019', 'success' from dual union all
     select 126, '16.09.2019', 'failure' from dual union all
     select 126, '17.09.2019', 'failure' from dual
    )
select
    id, timestamp, event  
from
(
  select id, rn, timestamp, event,  
    count(case when rn=2 and event='success' then 1 end) over (partition by id) as flag_simple,
    success_flag,
    min(case when success_flag>0 then success_flag end) over (partition by id) as critical_line
  from
  (
    select id, rn, timestamp, event, success_flag
      from (
                select 
                    id, 
                    timestamp, 
                    event, 
                    row_number() over (partition by id order by timestamp) as rn,
                    case when event='failure' and 
                        lead(event,1,'x') over (partition by id order by timestamp)='success'
                        then 1+row_number() over (partition by id order by timestamp) else 0 end 
                        as success_flag
                from 
                    test
            )
    )
  )
  where 
    flag_simple=0
    and
    (rn<=critical_line or critical_line is NULL)

Upvotes: 0

Ponder Stibbons
Ponder Stibbons

Reputation: 14848

select id, timestamp, event
  from (
    select id, timestamp, event, 
           lag(event, 1, 'x') over (partition by id order by timestamp) lg_event,
           count(case event when 'failure' then 1 end) over (partition by id) cf 
      from t)
  where cf <> 0 and (event = 'failure' or lg_event <> event)

dbfiddle demo

Use analytical count to find ids which have failures. Show only rows with failures or these where event changes, consecutive successes are ignored.

Upvotes: 1

Littlefoot
Littlefoot

Reputation: 142993

This is ugly, but returns result you want:

SQL> with
  2  test (id, timestamp, event) as
  3    -- sample data
  4    (select 123, '12.09.2019', 'success' from dual union all
  5     select 123, '13.09.2019', 'success' from dual union all
  6     --
  7     select 124, '12.09.2019', 'success' from dual union all
  8     select 124, '15.09.2019', 'failure' from dual union all
  9     select 124, '16.09.2019', 'success' from dual union all
 10     select 124, '17.09.2019', 'success' from dual union all
 11     select 124, '18.09.2019', 'failure' from dual union all
 12     --
 13     select 126, '12.09.2019', 'success' from dual union all
 14     select 126, '16.09.2019', 'failure' from dual union all
 15     select 126, '17.09.2019', 'failure' from dual
 16    ),
 17  valids_both as
 18    -- IDs have to have both success and failure events to be valid
 19    -- (eliminates 123)
 20    (select id
 21     from test
 22     group by id
 23     having count(distinct event) = 2
 24    ),
 25  valids_succ as
 26    -- search for timestamp of success which is not the starting success
 27    (select t.id, min(t.timestamp) timestamp
 28     from test t join valids_both v on v.id = t.id
 29     where t.event = 'success'
 30       and t.timestamp > (select min(t1.timestamp) From test t1
 31                          where t1.id = t.id
 32                         )
 33     group by t.id
 34    )
 35  -- this is ID = 124
 36  select t.id, t.timestamp, t.event
 37    from test t join valids_succ v on v.id = t.id
 38      and t.timestamp <= v.timestamp
 39  union
 40  -- this is ID = 126
 41  select t.id, t.timestamp, t.event
 42    from test t join valids_both v on v.id = t.id
 43    where not exists (select null from valids_succ v1
 44                      where v1.id = v.id
 45                     )
 46  order by id, timestamp;

        ID TIMESTAMP  EVENT
---------- ---------- -------
       124 12.09.2019 success
       124 15.09.2019 failure
       124 16.09.2019 success
       126 12.09.2019 success
       126 16.09.2019 failure
       126 17.09.2019 failure

6 rows selected.

SQL>

How would it work for a lot of data? I'm afraid to ask (once you test it).

Upvotes: 1

Related Questions