Brandon DuRette
Brandon DuRette

Reputation: 4870

Count of overlapping intervals in BigQuery

Given a table of intervals, can I efficiently query for the number of currently open intervals at the start of each interval (including the current interval itself)?

For example, given the following table:

start_time end_time
         1       10
         2        5
         3        4
         5        6
         7       11
        19       20

I want the following output:

start_time count
         1     1
         2     2
         3     3
         5     3
         7     2
        19     1

On small datasets, I can solve this problem by joining the dataset against itself:

WITH intervals AS (
  SELECT 1 AS start, 10 AS end UNION ALL
  SELECT 2, 5 UNION ALL
  SELECT 3, 4 UNION ALL
  SELECT 5, 6 UNION ALL
  SELECT 7, 11 UNION ALL
  SELECT 19, 20
)
SELECT 
  a.start_time,
  count(*)
FROM 
  intervals a CROSS JOIN intervals b
WHERE
  a.start_time >= b.start_time AND
  a.start_time <= b.end_time
GROUP BY a.start_time
ORDER BY a.start_time

With large datasets the CROSS JOIN is both impractical and unnecessary, since any given answer only depends on a small number of preceding intervals (when sorted by start_time). In fact, on the dataset that I have, it times out. Is there a better way to achieve this?

Upvotes: 0

Views: 1520

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172954

... CROSS JOIN is both impractical and unnecessary ...
Is there a better way to achieve this?

Try below for BigQuery Standard SQL. No JOINs involved

#standardSQL
SELECT 
  start_time,
  (SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt  
FROM (
  SELECT 
    start_time, 
    ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends
  FROM intervals
)
-- ORDER BY start_time  

You can test/play with it using below example with dummy data from your question

#standardSQL
WITH intervals AS (
  SELECT 1 AS start_time, 10 AS end_time UNION ALL
  SELECT 2, 5 UNION ALL
  SELECT 3, 4 UNION ALL
  SELECT 5, 6 UNION ALL
  SELECT 7, 11 UNION ALL
  SELECT 19, 20 
)
SELECT 
  start_time,
  (SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt  
FROM (
  SELECT 
    start_time, 
    ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends
  FROM intervals
)
-- ORDER BY start_time

Upvotes: 3

Related Questions