Aki Kanerva
Aki Kanerva

Reputation: 73

T-SQL: Count number of failures until first success

I have a DB consisting of events with timestamps:

row eventName taskName timestamp userName
1   fail      ABC      10.5      John
2   fail      ABC      18.0      John
3   fail      ABC      19.0      Mike
4   fail      XYZ      21.0      John
5   fail      XYZ      23.0      Mike
6   success   ABC      25.0      John
7   fail      ABC      26.0      John
8   success   ABC      28.0      John

I'd like to count the number of failures until the first success, per user (and average, but that's beyond this question).

In the example above, John attempted task ABC 2 times (rows 1 & 2) until succeeding (row 6). Subsequent failures and successes can be ignored.

I think I could achieve this by counting the number of rows with "ABC" and "fail" whose timestamp is earlier than the earliest timestamp in all rows with "ABC" and "success", grouping by userName. How do I express this in T-SQL? Specifically, Vertica.

This seemed to be very similar to the case here: sql count/sum the number of calls until a specific date in another column

But when I tried to adapt the code from https://stackoverflow.com/a/39594686/4354459 as below, I think I got something wrong because I keep getting larger counts than expected.

WITH
Successes
AS
(
    SELECT
        events.userName
        ,events.taskName
        ,MIN(events.timestamp) AS FirstSuccessTime
    FROM events
    WHERE events.eventName = 'success'
    GROUP BY events.userName, events.taskName
)
SELECT
    events.userName
    ,events.taskName
    ,COUNT(events.eventName) AS FailuresUntilFirstSuccess
FROM
    Successes
    LEFT JOIN events
        ON  events.taskName = Successes.taskName
        AND events.timestamp < Successes.FirstSuccessTime
        AND events.eventName = 'fail'
GROUP BY events.userName, events.taskName
;

Upvotes: 0

Views: 2456

Answers (6)

Artem
Artem

Reputation: 2314

One more solutions using CTEs. The idea is to receive min row with success first. After that we can filter out other rows using simple condition, like row < min per user:

DECLARE @events TABLE([row] INT,eventName VARCHAR(50),taskName VARCHAR(50),userName VARCHAR(50))

INSERT INTO @events
        VALUES(1,'fail','ABC','John')
       ,(2,'fail','ABC','John')
       ,(3,'fail','ABC','Mike')
       ,(4,'fail','XYZ','John')
       ,(5,'fail','XYZ','Mike')
       ,(6,'success','ABC','John')
       ,(7,'fail','ABC','John')
       ,(8,'success','ABC','John')
       ,(9,'success','ABC','Mike')


SELECT * FROM @events;

WITH
    cte
    AS (
            SELECT userName
                    ,taskName
                  ,MIN(row) AS [min]
            FROM @events
            WHERE eventName = 'success'
            GROUP BY userName,taskName)
SELECT e.userName
        ,e.taskName
      ,COUNT(1) AS attempts
FROM @events e
JOIN cte c ON e.userName = c.userName
WHERE e.row < c.[min]
GROUP BY e.userName, e.taskName

Upvotes: 2

Rodrick Chapman
Rodrick Chapman

Reputation: 5543

Solution

Based on the schema, this query will give you what you want:

with Failures as
(
    select * from Event where event_name = 'fail'
),

Q as
(
    select * from Event E
        outer apply
        (
            select count(*) cnt from Failures F
                where F.task_name = E.task_name and F.username = E.username and F.ts < E.ts
        ) F

    where E.event_name = 'success'
)

select * from
(
    select Q.*, 
    row_number() over (partition by event_name, task_name, username order by ts) o from Q
) K where K.o = 1

Testing with your data yields:

id event_name task_name  timestamp  username  cnt
-- ---------- ---------- ---------- --------- ---
6  success    ABC        25         John      2

But, I went a bit further and added another 'success' row for Mike

insert Event select 'success', 'XYZ', 29.0, 'Mike':

and get

 id event_name task_name  timestamp  username  cnt
 -- ---------- ---------- ---------- --------- ---
 6  success    ABC        25         John      2
 9  success    XYZ        29         Mike      1

as expected.

Explanation

The first CTE generates a set of failures. The second CTE is recursive, where the base case is the set of successes and the recursive case is a count (cardinality) of the set of failures that preceded a given success (with respect to a user and task name).

Finally, we use row_number with respect to partitioning on event_name, task_name and username, so that the first success for a given partition will be labeled as '1'. Then we just filter out all of the rows whose row_number is unequal to '1'.

Upvotes: 1

Shawn
Shawn

Reputation: 4786

There's probably an easier way to get here, but I'll try to look more later.

Test Data Setup

IF OBJECT_ID(N'tempdb..#taskevents', N'U') IS NOT NULL   
DROP TABLE #taskevents;  
GO  

CREATE TABLE #taskevents ( 
      eventName varchar(10)
    , taskName varchar(10)
    , ts decimal(3,1)
    , userName varchar(10)
) ;

INSERT INTO #taskevents ( eventName, taskName, ts, userName )
VALUES 
      ('fail','ABC','10.5','John')
    , ('fail','ABC','10.6','John')
    , ('fail','ABC','18.0','John')
    , ('fail','ABC','22.0','John')
    , ('fail','ABC','22.5','John')
    , ('success','ABC','25.0','John')

    , ('fail','ABC','26.0','John')
    , ('success','ABC','28.0','John')

    , ('fail','XYZ','10.7','John')
    , ('fail','XYZ','21.0','John')

    , ('fail','ABC','19.0','Mike')

    , ('fail','XYZ','23.0','Mike')
    , ('success','XYZ','28.5','Mike')

    , ('success','QVC','42.0','Mike')
;

Query Time

SELECT s3.userName, s3.taskName, AVG(s3.failCount) AS avgFailCount
FROM (
    SELECT s1.userName, s1.taskName, s1.ts, s1.PreviousTS, COALESCE(s2.failCount,0) AS failCount
    FROM (
        SELECT t1.userName, t1.taskName, t1.ts, LAG(t1.ts) OVER (PARTITION BY t1.userName, t1.taskName ORDER BY t1.ts) AS PreviousTS --ROW_NUMBER() OVER (PARTITION BY t1.userName ORDER BY t1.ts) AS rn
        FROM #taskevents t1
        WHERE t1.eventName = 'success'
    ) s1
    OUTER APPLY ( 
        SELECT t2.userName, t2.taskName,  COUNT(*) AS failCount 
        FROM #taskevents t2
        WHERE t2.eventName = 'fail'
            AND t2.userName = s1.userName
            AND t2.taskName = s1.taskName
            AND t2.ts < s1.ts  
            AND ( t2.ts >= s1.PreviousTS OR s1.PreviousTS IS NULL )
        GROUP BY t2.userName, t2.taskName
    ) s2
) s3
GROUP BY s3.userName, s3.taskName

That gives you the average failures per user.

Upvotes: 0

Evaldas Buinauskas
Evaldas Buinauskas

Reputation: 14097

Potential solution too:

SELECT *, COUNT(*) OVER (PARTITION BY T1.userName, T1.taskName)
FROM tbl AS T1
WHERE T1.row < (   SELECT MIN(row)
                   FROM tbl AS T2
                   WHERE T2.userName = T1.userName
                   AND T2.taskName = T1.taskName
                   AND T2.eventName = 'success');

Similar solution to Artem's, but slightly different. Subquery will find lowest row value for each user and task that succeeded and will filter row based on that.

Upvotes: 0

John Cappelletti
John Cappelletti

Reputation: 81990

Updated and Corrected - Missed userName

Perhaps with a Cross Apply and a WITH TIES clause.

The top subquery will return only the first "success" records partitioned by taskName, and userName. In this case, only row 6 would be returned.

Example

Select A.*
      ,B.*
 From (
        Select Top 1 with ties * 
         from YourTable
         Where [eventName]='success'
         Order By Row_Number() over (Partition By taskName,userName Order by [row])
      ) A
 Cross Apply (
                Select Cnt=count(*) 
                 From YourTable 
                 Where taskName=A.taskName and A.userName=userName and [row]<A.[Row] and eventName='fail'
             ) B

Returns

enter image description here

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1270583

In SQL Server or Vertica, you can use window functions to count the number of successes before each row. Then use that for aggregation:

select username,
       sum(case when success_cnt = 0 and eventName = 'failure' then 1 else 0 end) as numfailures_to_first_success
from (select e.*, 
             sum(case when e.eventName = 'Success' then 1 else 0 end) over (partition by e.username order by e.row) as success_cnt
      from events e
     ) e
group by username;

Note: If you want this per task, then include that in both the partition by and group by.

Upvotes: 2

Related Questions