Reputation: 565
Given a start date I have a long list of event-logs for many id's. For each id and starting point time-ordered sequences starting with an 'success'-entry are selected. I need to retrieve only id's with failure as next event and only transactions until 'success' (or up to the last entry if only failures follow). 'failure'-entries are sparse. The number of potentially following failures has no 'natural' boundary.
Simplified Input (date format dd.mm.yyyy):
id timestamp event
123 12.09.2019 success
123 13.09.2019 success
124 12.09.2019 success
124 15.09.2019 failure
124 16.09.2019 success
124 17.09.2019 success
124 18.09.2019 failure
126 12.09.2019 success
126 16.09.2019 failure
126 17.09.2019 failure
128 …
Expected Output:
124 12.09.2019 success
124 15.09.2019 failure
124 16.09.2019 success
126 12.09.2019 success
126 16.09.2019 failure
126 17.09.2019 failure
123 is discarded because the next event is success. Everything after the first success for 124 is discarded. 126 never reaches the 'success'-stage, so everything is retrieved.
I can test via lag/lead if the next transaction is success and exclude those - but how to find the next 'success'-row (if it might not even exist?). I solved this problem easily in python counting the 'success'-entries per id-group but I generate way to much IO in transfering all the data. Is there a way to count the occurrences of success per id maybe in a partition over - clause and cut after the 2?
I access an oracle 11g via cx_oracle/python from a jupyter-notebook (i.e. I pass sql-statements to the db). Each day has about 50k ids and up to a double digit million transactions.
Upvotes: 1
Views: 282
Reputation: 565
Based on Ponder Stibbons approach I formed two flags. The simple one is for the case that the next event already is success. In this case everything should be discarded, even if some failure follow afterward. The success_flag is trickier, I used the row_number to identify rows with the combination of 'failure' as current event and success in the following (lead) row. Taking the minimum of those rows yields the place 'critical_line' where I need to cut and discard the following rows. If no success is reached critical line is null. Performance is ok, less than 3 Minutes. As the final code differs significantly from Ponder Stibbons I chose to put it as an answer.
with
test (id, timestamp, event) as
-- sample data
(select 123, '12.09.2019', 'success' from dual union all
select 123, '13.09.2019', 'success' from dual union all
select 123, '15.09.2019', 'failure' from dual union all
select 124, '12.09.2019', 'success' from dual union all
select 124, '15.09.2019', 'failure' from dual union all
select 124, '16.09.2019', 'success' from dual union all
select 124, '17.09.2019', 'success' from dual union all
select 124, '18.09.2019', 'failure' from dual union all
select 126, '12.09.2019', 'success' from dual union all
select 126, '16.09.2019', 'failure' from dual union all
select 126, '17.09.2019', 'failure' from dual
)
select
id, timestamp, event
from
(
select id, rn, timestamp, event,
count(case when rn=2 and event='success' then 1 end) over (partition by id) as flag_simple,
success_flag,
min(case when success_flag>0 then success_flag end) over (partition by id) as critical_line
from
(
select id, rn, timestamp, event, success_flag
from (
select
id,
timestamp,
event,
row_number() over (partition by id order by timestamp) as rn,
case when event='failure' and
lead(event,1,'x') over (partition by id order by timestamp)='success'
then 1+row_number() over (partition by id order by timestamp) else 0 end
as success_flag
from
test
)
)
)
where
flag_simple=0
and
(rn<=critical_line or critical_line is NULL)
Upvotes: 0
Reputation: 14848
select id, timestamp, event
from (
select id, timestamp, event,
lag(event, 1, 'x') over (partition by id order by timestamp) lg_event,
count(case event when 'failure' then 1 end) over (partition by id) cf
from t)
where cf <> 0 and (event = 'failure' or lg_event <> event)
Use analytical count
to find id
s which have failures. Show only rows with failures or these where event changes, consecutive successes are ignored.
Upvotes: 1
Reputation: 142993
This is ugly, but returns result you want:
SQL> with
2 test (id, timestamp, event) as
3 -- sample data
4 (select 123, '12.09.2019', 'success' from dual union all
5 select 123, '13.09.2019', 'success' from dual union all
6 --
7 select 124, '12.09.2019', 'success' from dual union all
8 select 124, '15.09.2019', 'failure' from dual union all
9 select 124, '16.09.2019', 'success' from dual union all
10 select 124, '17.09.2019', 'success' from dual union all
11 select 124, '18.09.2019', 'failure' from dual union all
12 --
13 select 126, '12.09.2019', 'success' from dual union all
14 select 126, '16.09.2019', 'failure' from dual union all
15 select 126, '17.09.2019', 'failure' from dual
16 ),
17 valids_both as
18 -- IDs have to have both success and failure events to be valid
19 -- (eliminates 123)
20 (select id
21 from test
22 group by id
23 having count(distinct event) = 2
24 ),
25 valids_succ as
26 -- search for timestamp of success which is not the starting success
27 (select t.id, min(t.timestamp) timestamp
28 from test t join valids_both v on v.id = t.id
29 where t.event = 'success'
30 and t.timestamp > (select min(t1.timestamp) From test t1
31 where t1.id = t.id
32 )
33 group by t.id
34 )
35 -- this is ID = 124
36 select t.id, t.timestamp, t.event
37 from test t join valids_succ v on v.id = t.id
38 and t.timestamp <= v.timestamp
39 union
40 -- this is ID = 126
41 select t.id, t.timestamp, t.event
42 from test t join valids_both v on v.id = t.id
43 where not exists (select null from valids_succ v1
44 where v1.id = v.id
45 )
46 order by id, timestamp;
ID TIMESTAMP EVENT
---------- ---------- -------
124 12.09.2019 success
124 15.09.2019 failure
124 16.09.2019 success
126 12.09.2019 success
126 16.09.2019 failure
126 17.09.2019 failure
6 rows selected.
SQL>
How would it work for a lot of data? I'm afraid to ask (once you test it).
Upvotes: 1