Reputation: 1
This question regards SQL (specifically using Presto).
I have a situation where agents are interacting with tasks (commenting, taking ownership, adding tags, etc.) and I'm trying to group each series of interactions into one line with a summarized start and end time.
Basically, the table currently looks something like this:
Task Name | Agent Name | Event Type | Time |
---|---|---|---|
Computer Repair Help | Dave | Claim | 2:35PM |
Computer Repair Help | Dave | Comment | 2:37PM |
Computer Repair Help | Dave | Tag | 2:39PM |
Computer Repair Help | Dave | Pass | 2:45PM |
Computer Repair Help | Susie | Claim | 2:56PM |
Computer Repair Help | Susie | Tag | 2:58PM |
Computer Repair Help | Susie | Pass | 3:02PM |
Computer Repair Help | Dave | Claim | 3:05PM |
Computer Repair Help | Dave | Comment | 3:08PM |
Computer Repair Help | Dave | Comment | 3:11PM |
Computer Repair Help | Dave | Close | 3:16PM |
But I am hoping to get something like this:
Task Name | Agent Name | Start Time | End Time |
---|---|---|---|
Computer Repair Help | Dave | 2:35PM | 2:45PM |
Computer Repair Help | Susie | 2:56PM | 3:02PM |
Computer Repair Help | Dave | 3:05PM | 3:16PM |
I've tried partitioning by task name and agent name to get the first and last row of each interaction, but the main problem is that if an agent has two interaction windows, it will give me the total range of all their interactions instead of separating like in the example. Not sure how to tell the query to treat it like a new interaction after every agent name change.
Any help would be greatly appreciated... have been bashing my head against this for weeks and it seems like there must be some simple solution I'm not seeing. Thanks!
Upvotes: 0
Views: 497
Reputation: 143003
This is a variant of gaps-and-islands problem. Using lag()
you can find where the value changes and then do a cumulative sum to assign groups.
WITH dataset AS (
SELECT * FROM (VALUES
('Computer Repair Help' ,'Dave',' Claim' ,'2:35PM'),
('Computer Repair Help' ,'Dave',' Comment' ,'2:37PM'),
('Computer Repair Help' ,'Dave',' Tag', '2:39PM'),
('Computer Repair Help' ,'Dave',' Pass' ,'2:45PM'),
('Computer Repair Help' ,'Susie',' Claim' ,'2:56PM'),
('Computer Repair Help' ,'Susie',' Tag','2:58PM'),
('Computer Repair Help' ,'Susie',' Pass' ,'3:02PM'),
('Computer Repair Help' ,'Dave ','Claim' ,'3:05PM'),
('Computer Repair Help' ,'Dave ','Comment' ,'3:08PM'),
('Computer Repair Help' ,'Dave ','Comment' ,'3:11PM'),
('Computer Repair Help' ,'Dave ','Close' ,'3:16PM')
) AS t (task, name, event_type, time))
select task, name, min(time) as start_time, max(time) as end_time
from (
select *,
sum(case when name = prev_id then 0 else 1 end) over (partition by task, name order by time) as grp
from (
select task, name, time,
lag(name) over (order by time) as prev_id
from dataset
)
)
group by task, name, grp
Output:
task | name | start_time | end_time |
---|---|---|---|
Computer Repair Help | Dave | 2:35PM | 2:45PM |
Computer Repair Help | Susie | 2:56PM | 3:02PM |
Computer Repair Help | Dave | 3:05PM | 3:16PM |
Note that possibly you will need to parse your time into actual one.
Upvotes: 0
Reputation: 10035
This can be solved as a "gaps and islands" problem. I assumed based on your dataset that each new set of interactions started with a Claim
.
As such the CTE
(which may also be rewritten as a subquery) used the SUM
window function to create groups with the assistance of a case expression. The final projection grouped the data using this GroupNum
(where RANGE UNBOUNDED PRECEDING
is optional), Task and Event Type in order to find the StartTime
as MIN(Time)
and EndTime
as MAX(Time)
for each group.
You may try the following:
WITH CTE AS (
SELECT
*,
SUM(
CASE
WHEN EventType='Claim' THEN 1
ELSE 0
END
) OVER (
PARTITION BY TaskName, AgentName
ORDER BY Time
RANGE UNBOUNDED PRECEDING
) as GroupNum
FROM
my_table
)
SELECT
TaskName,
AgentName,
MIN(Time) as StartTime,
MAX(Time) as EndTime
FROM
CTE
GROUP BY
TaskName,
AgentName,
GroupNum
TaskName | AgentName | StartTime | EndTime |
---|---|---|---|
Computer Repair Help | Dave | 2:35PM | 2:45PM |
Computer Repair Help | Susie | 2:56PM | 3:02PM |
Computer Repair Help | Dave | 3:05PM | 3:16PM |
Let me know if this works for you.
Upvotes: 0
Reputation: 5316
You can solve this using MATCH_RECOGNIZE
, which is available in recent versions of Trino (formerly known as Presto SQL):
WITH data(task, agent, event, ts) AS (
VALUES ('Computer Repair Help', 'Dave', 'Claim', '2:35PM'),
('Computer Repair Help', 'Dave', 'Comment', '2:37PM'),
('Computer Repair Help', 'Dave', 'Tag', '2:39PM'),
('Computer Repair Help', 'Dave', 'Pass', '2:45PM'),
('Computer Repair Help', 'Susie', 'Claim', '2:56PM'),
('Computer Repair Help', 'Susie', 'Tag', '2:58PM'),
('Computer Repair Help', 'Susie', 'Pass', '3:02PM'),
('Computer Repair Help', 'Dave', 'Claim', '3:05PM'),
('Computer Repair Help', 'Dave', 'Comment', '3:08PM'),
('Computer Repair Help', 'Dave', 'Comment', '3:11PM'),
('Computer Repair Help', 'Dave', 'Close', '3:16PM')
)
SELECT task, agent, start_time, end_time
FROM data MATCH_RECOGNIZE (
PARTITION BY task
ORDER BY ts
MEASURES
FIRST(M.agent) AS agent,
FIRST(M.ts) AS start_time,
LAST(M.ts) AS end_time
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (M+)
DEFINE M AS agent = FIRST(agent)
)
=>
task | agent | start_time | end_time
----------------------+-------+------------+----------
Computer Repair Help | Dave | 2:35PM | 2:45PM
Computer Repair Help | Susie | 2:56PM | 3:02PM
Computer Repair Help | Dave | 3:05PM | 3:16PM
(3 rows)
The basic idea is to:
You can find more details about MATCH_RECOGNIZE here:
Upvotes: 2