HelasMMB
HelasMMB

Reputation: 1

Group a series of events into one line with a start and end time

This question regards SQL (specifically using Presto).

I have a situation where agents are interacting with tasks (commenting, taking ownership, adding tags, etc.) and I'm trying to group each series of interactions into one line with a summarized start and end time.

Basically, the table currently looks something like this:

Task Name Agent Name Event Type Time
Computer Repair Help Dave Claim 2:35PM
Computer Repair Help Dave Comment 2:37PM
Computer Repair Help Dave Tag 2:39PM
Computer Repair Help Dave Pass 2:45PM
Computer Repair Help Susie Claim 2:56PM
Computer Repair Help Susie Tag 2:58PM
Computer Repair Help Susie Pass 3:02PM
Computer Repair Help Dave Claim 3:05PM
Computer Repair Help Dave Comment 3:08PM
Computer Repair Help Dave Comment 3:11PM
Computer Repair Help Dave Close 3:16PM

But I am hoping to get something like this:

Task Name Agent Name Start Time End Time
Computer Repair Help Dave 2:35PM 2:45PM
Computer Repair Help Susie 2:56PM 3:02PM
Computer Repair Help Dave 3:05PM 3:16PM

I've tried partitioning by task name and agent name to get the first and last row of each interaction, but the main problem is that if an agent has two interaction windows, it will give me the total range of all their interactions instead of separating like in the example. Not sure how to tell the query to treat it like a new interaction after every agent name change.

Any help would be greatly appreciated... have been bashing my head against this for weeks and it seems like there must be some simple solution I'm not seeing. Thanks!

Upvotes: 0

Views: 497

Answers (3)

Guru Stron
Guru Stron

Reputation: 143003

This is a variant of gaps-and-islands problem. Using lag() you can find where the value changes and then do a cumulative sum to assign groups.

WITH dataset AS (
    SELECT * FROM (VALUES   
('Computer Repair Help' ,'Dave','   Claim'  ,'2:35PM'),
('Computer Repair Help' ,'Dave','   Comment'    ,'2:37PM'),
('Computer Repair Help' ,'Dave','   Tag',   '2:39PM'),
('Computer Repair Help' ,'Dave','   Pass'   ,'2:45PM'),
('Computer Repair Help' ,'Susie','  Claim'  ,'2:56PM'),
('Computer Repair Help' ,'Susie','  Tag','2:58PM'),
('Computer Repair Help' ,'Susie','  Pass'   ,'3:02PM'),
('Computer Repair Help' ,'Dave  ','Claim'   ,'3:05PM'),
('Computer Repair Help' ,'Dave  ','Comment' ,'3:08PM'),
('Computer Repair Help' ,'Dave  ','Comment' ,'3:11PM'),
('Computer Repair Help' ,'Dave  ','Close'   ,'3:16PM')
                   
 ) AS t (task, name, event_type, time))


select task, name, min(time) as start_time, max(time) as end_time
from (
         select *,
                sum(case when name = prev_id then 0 else 1 end) over (partition by task, name order by time) as grp
         from (
                  select task, name, time,
                         lag(name) over (order by time) as prev_id
                  from dataset
              )
     )
group by task, name, grp

Output:

task name start_time end_time
Computer Repair Help Dave 2:35PM 2:45PM
Computer Repair Help Susie 2:56PM 3:02PM
Computer Repair Help Dave 3:05PM 3:16PM

Note that possibly you will need to parse your time into actual one.

Upvotes: 0

ggordon
ggordon

Reputation: 10035

This can be solved as a "gaps and islands" problem. I assumed based on your dataset that each new set of interactions started with a Claim.

As such the CTE (which may also be rewritten as a subquery) used the SUM window function to create groups with the assistance of a case expression. The final projection grouped the data using this GroupNum (where RANGE UNBOUNDED PRECEDING is optional), Task and Event Type in order to find the StartTime as MIN(Time) and EndTime as MAX(Time) for each group.

You may try the following:

WITH CTE AS (
    SELECT
        *,
        SUM(
            CASE
                WHEN EventType='Claim' THEN 1
                ELSE 0
            END
        ) OVER (
            PARTITION BY TaskName, AgentName 
            ORDER BY Time 
            RANGE UNBOUNDED PRECEDING
        ) as GroupNum
    FROM
        my_table
)
SELECT
    TaskName,
    AgentName,
    MIN(Time) as StartTime,
    MAX(Time) as EndTime
FROM
    CTE
GROUP BY
    TaskName,
    AgentName,
    GroupNum
TaskName AgentName StartTime EndTime
Computer Repair Help Dave 2:35PM 2:45PM
Computer Repair Help Susie 2:56PM 3:02PM
Computer Repair Help Dave 3:05PM 3:16PM

View on DB Fiddle

Let me know if this works for you.

Upvotes: 0

Martin Traverso
Martin Traverso

Reputation: 5316

You can solve this using MATCH_RECOGNIZE, which is available in recent versions of Trino (formerly known as Presto SQL):

WITH data(task, agent, event, ts) AS (
    VALUES ('Computer Repair Help', 'Dave', 'Claim', '2:35PM'),
           ('Computer Repair Help', 'Dave', 'Comment', '2:37PM'),
           ('Computer Repair Help', 'Dave', 'Tag', '2:39PM'),
           ('Computer Repair Help', 'Dave', 'Pass', '2:45PM'),
           ('Computer Repair Help', 'Susie', 'Claim', '2:56PM'),
           ('Computer Repair Help', 'Susie', 'Tag', '2:58PM'),
           ('Computer Repair Help', 'Susie', 'Pass', '3:02PM'),
           ('Computer Repair Help', 'Dave', 'Claim', '3:05PM'),
           ('Computer Repair Help', 'Dave', 'Comment', '3:08PM'),
           ('Computer Repair Help', 'Dave', 'Comment', '3:11PM'),
           ('Computer Repair Help', 'Dave', 'Close', '3:16PM')
)
SELECT task, agent, start_time, end_time
FROM data MATCH_RECOGNIZE (
        PARTITION BY task
        ORDER BY ts
        MEASURES
            FIRST(M.agent) AS agent,
            FIRST(M.ts) AS start_time,
            LAST(M.ts) AS end_time
        ONE ROW PER MATCH
        AFTER MATCH SKIP PAST LAST ROW
        PATTERN (M+)
        DEFINE M AS agent = FIRST(agent)
    )

=>

         task         | agent | start_time | end_time
----------------------+-------+------------+----------
 Computer Repair Help | Dave  | 2:35PM     | 2:45PM
 Computer Repair Help | Susie | 2:56PM     | 3:02PM
 Computer Repair Help | Dave  | 3:05PM     | 3:16PM
(3 rows)

The basic idea is to:

  • Partition the dataset by task and order it by time
  • Match sequences of rows whose agent is equal to the agent of the first row in the sequence
  • Extract the time of the first and last events in the sequence

You can find more details about MATCH_RECOGNIZE here:

Upvotes: 2

Related Questions