Reputation: 1

Group a series of events into one line with a start and end time

This question regards SQL (specifically using Presto).

I have a situation where agents are interacting with tasks (commenting, taking ownership, adding tags, etc.) and I'm trying to group each series of interactions into one line with a summarized start and end time.

Basically, the table currently looks something like this:

Task Name	Agent Name	Event Type	Time
Computer Repair Help	Dave	Claim	2:35PM
Computer Repair Help	Dave	Comment	2:37PM
Computer Repair Help	Dave	Tag	2:39PM
Computer Repair Help	Dave	Pass	2:45PM
Computer Repair Help	Susie	Claim	2:56PM
Computer Repair Help	Susie	Tag	2:58PM
Computer Repair Help	Susie	Pass	3:02PM
Computer Repair Help	Dave	Claim	3:05PM
Computer Repair Help	Dave	Comment	3:08PM
Computer Repair Help	Dave	Comment	3:11PM
Computer Repair Help	Dave	Close	3:16PM

But I am hoping to get something like this:

Task Name	Agent Name	Start Time	End Time
Computer Repair Help	Dave	2:35PM	2:45PM
Computer Repair Help	Susie	2:56PM	3:02PM
Computer Repair Help	Dave	3:05PM	3:16PM

I've tried partitioning by task name and agent name to get the first and last row of each interaction, but the main problem is that if an agent has two interaction windows, it will give me the total range of all their interactions instead of separating like in the example. Not sure how to tell the query to treat it like a new interaction after every agent name change.

Any help would be greatly appreciated... have been bashing my head against this for weeks and it seems like there must be some simple solution I'm not seeing. Thanks!

Upvotes: 0

Answers (3)

Guru Stron

Reputation: 143003

This is a variant of gaps-and-islands problem. Using lag() you can find where the value changes and then do a cumulative sum to assign groups.

WITH dataset AS (
    SELECT * FROM (VALUES   
('Computer Repair Help' ,'Dave','   Claim'  ,'2:35PM'),
('Computer Repair Help' ,'Dave','   Comment'    ,'2:37PM'),
('Computer Repair Help' ,'Dave','   Tag',   '2:39PM'),
('Computer Repair Help' ,'Dave','   Pass'   ,'2:45PM'),
('Computer Repair Help' ,'Susie','  Claim'  ,'2:56PM'),
('Computer Repair Help' ,'Susie','  Tag','2:58PM'),
('Computer Repair Help' ,'Susie','  Pass'   ,'3:02PM'),
('Computer Repair Help' ,'Dave  ','Claim'   ,'3:05PM'),
('Computer Repair Help' ,'Dave  ','Comment' ,'3:08PM'),
('Computer Repair Help' ,'Dave  ','Comment' ,'3:11PM'),
('Computer Repair Help' ,'Dave  ','Close'   ,'3:16PM')
                   
 ) AS t (task, name, event_type, time))


select task, name, min(time) as start_time, max(time) as end_time
from (
         select *,
                sum(case when name = prev_id then 0 else 1 end) over (partition by task, name order by time) as grp
         from (
                  select task, name, time,
                         lag(name) over (order by time) as prev_id
                  from dataset
              )
     )
group by task, name, grp

Output:

task	name	start_time	end_time
Computer Repair Help	Dave	2:35PM	2:45PM
Computer Repair Help	Susie	2:56PM	3:02PM
Computer Repair Help	Dave	3:05PM	3:16PM

Note that possibly you will need to parse your time into actual one.

Upvotes: 0

ggordon

Reputation: 10035

This can be solved as a "gaps and islands" problem. I assumed based on your dataset that each new set of interactions started with a Claim.

As such the CTE (which may also be rewritten as a subquery) used the SUM window function to create groups with the assistance of a case expression. The final projection grouped the data using this GroupNum (where RANGE UNBOUNDED PRECEDING is optional), Task and Event Type in order to find the StartTime as MIN(Time) and EndTime as MAX(Time) for each group.

You may try the following:

WITH CTE AS (
    SELECT
        *,
        SUM(
            CASE
                WHEN EventType='Claim' THEN 1
                ELSE 0
            END
        ) OVER (
            PARTITION BY TaskName, AgentName 
            ORDER BY Time 
            RANGE UNBOUNDED PRECEDING
        ) as GroupNum
    FROM
        my_table
)
SELECT
    TaskName,
    AgentName,
    MIN(Time) as StartTime,
    MAX(Time) as EndTime
FROM
    CTE
GROUP BY
    TaskName,
    AgentName,
    GroupNum

TaskName	AgentName	StartTime	EndTime
Computer Repair Help	Dave	2:35PM	2:45PM
Computer Repair Help	Susie	2:56PM	3:02PM
Computer Repair Help	Dave	3:05PM	3:16PM

View on DB Fiddle

Let me know if this works for you.

Upvotes: 0

Martin Traverso

Reputation: 5316

You can solve this using MATCH_RECOGNIZE, which is available in recent versions of Trino (formerly known as Presto SQL):

WITH data(task, agent, event, ts) AS (
    VALUES ('Computer Repair Help', 'Dave', 'Claim', '2:35PM'),
           ('Computer Repair Help', 'Dave', 'Comment', '2:37PM'),
           ('Computer Repair Help', 'Dave', 'Tag', '2:39PM'),
           ('Computer Repair Help', 'Dave', 'Pass', '2:45PM'),
           ('Computer Repair Help', 'Susie', 'Claim', '2:56PM'),
           ('Computer Repair Help', 'Susie', 'Tag', '2:58PM'),
           ('Computer Repair Help', 'Susie', 'Pass', '3:02PM'),
           ('Computer Repair Help', 'Dave', 'Claim', '3:05PM'),
           ('Computer Repair Help', 'Dave', 'Comment', '3:08PM'),
           ('Computer Repair Help', 'Dave', 'Comment', '3:11PM'),
           ('Computer Repair Help', 'Dave', 'Close', '3:16PM')
)
SELECT task, agent, start_time, end_time
FROM data MATCH_RECOGNIZE (
        PARTITION BY task
        ORDER BY ts
        MEASURES
            FIRST(M.agent) AS agent,
            FIRST(M.ts) AS start_time,
            LAST(M.ts) AS end_time
        ONE ROW PER MATCH
        AFTER MATCH SKIP PAST LAST ROW
        PATTERN (M+)
        DEFINE M AS agent = FIRST(agent)
    )

         task         | agent | start_time | end_time
----------------------+-------+------------+----------
 Computer Repair Help | Dave  | 2:35PM     | 2:45PM
 Computer Repair Help | Susie | 2:56PM     | 3:02PM
 Computer Repair Help | Dave  | 3:05PM     | 3:16PM
(3 rows)

The basic idea is to:

Partition the dataset by task and order it by time
Match sequences of rows whose agent is equal to the agent of the first row in the sequence
Extract the time of the first and last events in the sequence

You can find more details about MATCH_RECOGNIZE here:

Documentation: https://trino.io/docs/current/sql/match-recognize.html
Blog post: https://trino.io/blog/2021/05/19/row_pattern_matching.html
Podcast: https://trino.io/episodes/23.html

Upvotes: 2

Group a series of events into one line with a start and end time

Answers (3)

Related Questions