Reputation: 49590
Given each row that represents a task, with start time and end time, how can I calculate the number of running tasks (i.e. that started and not ended) at the time each task starts (including itself) using a window function with COUNT OVER
? Is a window function even the right approach?
Example, given table tasks
:
task_id start_time end_time
a 1 10
b 2 5
c 5 15
d 8 13
e 12 20
f 21 30
Calculate running_tasks
:
task_id start_time end_time running_tasks
a 1 10 1 # a
b 2 5 2 # a,b
c 5 15 2 # a,c (b has ended)
d 8 13 3 # a,c,d
e 12 20 3 # c,d,e (a has ended)
f 21 30 1 # f (c,d,e have ended)
Upvotes: 3
Views: 152
Reputation: 173013
As Elliott has mentioned - "it's generally more difficult to explain analytic functions to new users" and even established users do not always 100% good at it (while being very very close to it)!
So, while Dudu Markovitz answer is great - unfortunatelly, it is still incorrect (at least as per how I understood question). The case when it is not correct is when you have multiple tasks started at the same start_time - so those tasks have wrong "running tasks" result
As an example - consider below example:
task_id start_time end_time
a 1 10
aa 1 2
aaa 1 8
b 2 5
c 5 15
d 8 13
e 12 20
f 21 30
I think, you would expect below result:
task_id start_time end_time running_tasks
a 1 10 3 # a,aa,aaa
aa 1 2 3 # a,aa,aaa
aaa 1 8 3 # a,aa,aaa
b 2 5 3 # a,aaa,b (aa has ended)
c 5 15 3 # a,aaa,c (b has ended)
d 8 13 3 # a,c,d (aaa has ended)
e 12 20 3 # c,d,e (a has ended)
f 21 30 1 # f (c,d,e have ended)
If you will try it with Dudu's code - you will get below instead
task_id start_time end_time running_tasks
a 1 10 1
aa 1 2 2
aaa 1 8 3
b 2 5 3
c 5 15 3
d 8 13 3
e 12 20 3
f 21 30 1
As you can see result for tasks a and aa wrong.
The reason is because of use of ROWS UNBOUNDED PRECEDING
instead of RANGE UNBOUNDED PRECEDING
- small but very important nuance!
So below query will give you correct result
SELECT task_id,start_time,end_time,running_tasks
FROM (
SELECT
task_id, tm, op, start_time, end_time,
SUM(op) OVER (ORDER BY tm ,op RANGE UNBOUNDED PRECEDING) AS running_tasks
FROM (
SELECT
task_id, start_time AS tm, 1 AS op, start_time, end_time
FROM tasks UNION ALL
SELECT
task_id, end_time AS tm, -1 AS op, start_time, end_time
FROM tasks
) t
)t
WHERE op = 1
ORDER BY start_time
quick summary:
ROWS UNBOUNDED PRECEDING - sets the window frame based on rows' position
whereas
RANGE UNBOUNDED PRECEDING - sets the window frame based on rows values
Again - as Elliott has mentioned - this is much more complex to fully get into it than JOIN concept - but it worth it (as it is much more efficient than joins) - see more about Window Frame Clause and ROWS vs RANGE use
Upvotes: 2
Reputation: 33745
You can use a correlated subquery, which in this case is a self-join; no analytic functions are needed. After enabling standard SQL (uncheck "Use Legacy SQL" under "Show Options" in the UI) you can run this example:
WITH tasks AS (
SELECT
task_id,
start_time,
end_time
FROM UNNEST(ARRAY<STRUCT<task_id STRING, start_time INT64, end_time INT64>>[
('a', 1, 10),
('b', 2, 5),
('c', 5, 15),
('d', 8, 13),
('e', 12, 20),
('f', 21, 30)
])
)
SELECT
*,
(SELECT COUNT(*) FROM tasks t2
WHERE t.start_time >= t2.start_time AND
t.start_time < t2.end_time) AS running_tasks
FROM tasks t
ORDER BY task_id;
Upvotes: 2
Reputation: 44951
select task_id,start_time,end_time,running_tasks
from (select task_id,tm,op,start_time,end_time
,sum(op) over
(
order by tm,op
rows unbounded preceding
) as running_tasks
from (select task_id,start_time as tm,1 as op,start_time,end_time
from tasks
union all
select task_id,end_time as tm,-1 as op,start_time,end_time
from tasks
) t
)t
where op = 1
;
Upvotes: 2