New Dev
New Dev

Reputation: 49590

COUNT() OVER conditioned on the CURRENT ROW

Given each row that represents a task, with start time and end time, how can I calculate the number of running tasks (i.e. that started and not ended) at the time each task starts (including itself) using a window function with COUNT OVER? Is a window function even the right approach?

Example, given table tasks:

task_id  start_time  end_time
   a         1          10
   b         2           5
   c         5          15
   d         8          13
   e        12          20
   f        21          30

Calculate running_tasks:

task_id  start_time  end_time  running_tasks
   a         1          10           1         # a
   b         2           5           2         # a,b
   c         5          15           2         # a,c (b has ended)
   d         8          13           3         # a,c,d
   e        12          20           3         # c,d,e (a has ended)
   f        21          30           1         # f (c,d,e have ended)

Upvotes: 3

Views: 152

Answers (3)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173013

As Elliott has mentioned - "it's generally more difficult to explain analytic functions to new users" and even established users do not always 100% good at it (while being very very close to it)!
So, while Dudu Markovitz answer is great - unfortunatelly, it is still incorrect (at least as per how I understood question). The case when it is not correct is when you have multiple tasks started at the same start_time - so those tasks have wrong "running tasks" result

As an example - consider below example:

task_id  start_time  end_time
   a         1          10
   aa        1           2
   aaa       1           8
   b         2           5
   c         5          15
   d         8          13
   e        12          20
   f        21          30

I think, you would expect below result:

task_id  start_time  end_time  running_tasks
   a         1          10           3         # a,aa,aaa
   aa        1           2           3         # a,aa,aaa
   aaa       1           8           3         # a,aa,aaa
   b         2           5           3         # a,aaa,b (aa has ended)
   c         5          15           3         # a,aaa,c (b has ended)
   d         8          13           3         # a,c,d (aaa has ended)
   e        12          20           3         # c,d,e (a has ended)
   f        21          30           1         # f (c,d,e have ended)     

If you will try it with Dudu's code - you will get below instead

task_id  start_time  end_time  running_tasks
   a         1          10           1        
   aa        1           2           2        
   aaa       1           8           3        
   b         2           5           3        
   c         5          15           3        
   d         8          13           3        
   e        12          20           3        
   f        21          30           1        

As you can see result for tasks a and aa wrong.
The reason is because of use of ROWS UNBOUNDED PRECEDING instead of RANGE UNBOUNDED PRECEDING - small but very important nuance!

So below query will give you correct result

SELECT  task_id,start_time,end_time,running_tasks 
FROM  (
  SELECT  
    task_id, tm, op, start_time, end_time,
    SUM(op) OVER (ORDER BY  tm ,op RANGE UNBOUNDED PRECEDING) AS running_tasks 
  FROM  (
    SELECT  
      task_id, start_time AS tm, 1 AS op, start_time, end_time 
    FROM  tasks UNION  ALL 
    SELECT  
      task_id, end_time AS tm, -1 AS op, start_time, end_time 
    FROM  tasks 
  ) t 
)t 
WHERE  op = 1
ORDER BY start_time       

quick summary:
ROWS UNBOUNDED PRECEDING - sets the window frame based on rows' position
whereas
RANGE UNBOUNDED PRECEDING - sets the window frame based on rows values

Again - as Elliott has mentioned - this is much more complex to fully get into it than JOIN concept - but it worth it (as it is much more efficient than joins) - see more about Window Frame Clause and ROWS vs RANGE use

Upvotes: 2

Elliott Brossard
Elliott Brossard

Reputation: 33745

You can use a correlated subquery, which in this case is a self-join; no analytic functions are needed. After enabling standard SQL (uncheck "Use Legacy SQL" under "Show Options" in the UI) you can run this example:

WITH tasks AS (
  SELECT
    task_id,
    start_time,
    end_time
  FROM UNNEST(ARRAY<STRUCT<task_id STRING, start_time INT64, end_time INT64>>[
    ('a', 1, 10),
    ('b', 2, 5),
    ('c', 5, 15),
    ('d', 8, 13),
    ('e', 12, 20),
    ('f', 21, 30)
  ])
)
SELECT
  *,
  (SELECT COUNT(*) FROM tasks t2
   WHERE t.start_time >= t2.start_time AND
   t.start_time < t2.end_time) AS running_tasks
FROM tasks t
ORDER BY task_id;

Upvotes: 2

David דודו Markovitz
David דודו Markovitz

Reputation: 44951

select      task_id,start_time,end_time,running_tasks 

from       (select      task_id,tm,op,start_time,end_time

                       ,sum(op) over 
                        (
                            order by    tm,op 
                            rows        unbounded preceding
                        ) as running_tasks 

            from       (select      task_id,start_time as tm,1 as op,start_time,end_time 
                        from        tasks 

                        union   all 

                        select      task_id,end_time as tm,-1 as op,start_time,end_time 
                        from        tasks 
                        ) t 
            )t 

where       op = 1
;

Upvotes: 2

Related Questions