Reputation: 4870
Given a table of intervals, can I efficiently query for the number of currently open intervals at the start of each interval (including the current interval itself)?
For example, given the following table:
start_time end_time 1 10 2 5 3 4 5 6 7 11 19 20
I want the following output:
start_time count 1 1 2 2 3 3 5 3 7 2 19 1
On small datasets, I can solve this problem by joining the dataset against itself:
WITH intervals AS (
SELECT 1 AS start, 10 AS end UNION ALL
SELECT 2, 5 UNION ALL
SELECT 3, 4 UNION ALL
SELECT 5, 6 UNION ALL
SELECT 7, 11 UNION ALL
SELECT 19, 20
)
SELECT
a.start_time,
count(*)
FROM
intervals a CROSS JOIN intervals b
WHERE
a.start_time >= b.start_time AND
a.start_time <= b.end_time
GROUP BY a.start_time
ORDER BY a.start_time
With large datasets the CROSS JOIN is both impractical and unnecessary, since any given answer only depends on a small number of preceding intervals (when sorted by start_time
). In fact, on the dataset that I have, it times out. Is there a better way to achieve this?
Upvotes: 0
Views: 1520
Reputation: 172954
... CROSS JOIN is both impractical and unnecessary ...
Is there a better way to achieve this?
Try below for BigQuery Standard SQL. No JOINs involved
#standardSQL
SELECT
start_time,
(SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt
FROM (
SELECT
start_time,
ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends
FROM intervals
)
-- ORDER BY start_time
You can test/play with it using below example with dummy data from your question
#standardSQL
WITH intervals AS (
SELECT 1 AS start_time, 10 AS end_time UNION ALL
SELECT 2, 5 UNION ALL
SELECT 3, 4 UNION ALL
SELECT 5, 6 UNION ALL
SELECT 7, 11 UNION ALL
SELECT 19, 20
)
SELECT
start_time,
(SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt
FROM (
SELECT
start_time,
ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends
FROM intervals
)
-- ORDER BY start_time
Upvotes: 3