Reputation: 655
I want to find maximum number of overlapping intervals I have in given period of time. Data have always start and end timestamp. And for given period of time (i.e. hourly) I want to get how many in total unique rows I had that was in given time, and a bit more troublesome maximum of concurrent ones in it.
Sample data:
id | start | end |
---|---|---|
1 | 2011-12-19 06:00:00 | 2011-12-19 08:45:00 |
2 | 2011-12-19 06:15:00 | 2011-12-19 06:30:00 |
3 | 2011-12-19 06:30:00 | 2011-12-19 06:45:00 |
4 | 2011-12-19 06:40:00 | 2011-12-19 07:15:00 |
5 | 2011-12-19 07:15:00 | 2011-12-19 08:45:00 |
6 | 2011-12-19 07:30:00 | 2011-12-19 07:50:00 |
7 | 2011-12-19 08:00:00 | 2011-12-19 08:30:00 |
8 | 2011-12-19 08:00:00 | 2011-12-19 08:15:00 |
9 | 2011-12-19 08:30:00 | 2011-12-19 08:45:00 |
For this data hourly result would look like:
id | period | max | total |
---|---|---|---|
1 | 2011-12-18 06:00:00 - 2011-12-19 07:00:00 | 3 | 4 |
2 | 2011-12-19 07:00:00 - 2011-12-19 08:00:00 | 3 | 4 |
3 | 2011-12-19 08:00:00 - 2011-12-19 09:00:00 | 4 | 5 |
Where max (max concurrent) would be:
2011-12-18 06:00:00 - Concurrent sessions: (2,1), (3,1,4) Total: 1,2,3,4
2011-12-18 07:00:00 - Concurrent sessions: (1,4), (5,1,6) Total: 1,4,5,6
2011-12-18 08:00:00 - Concurrent sessions: (1,5,7,8), (9,1,5) Total: 1,5,7,8,9
Any ideas how I could achieve something like this using SQL (BigQuery)?
Upvotes: 0
Views: 1195
Reputation: 172964
Consider below approach - seems to me most simple and least verbose
select
timestamp_trunc(ts, hour) hour,
max(concurrent) `max`,
hll_count.merge(ids) total
from (
select ts, count(distinct id) concurrent, hll_count.init(id) ids
from `project.dataset.table`,
unnest(generate_timestamp_array(start, `end`, interval 1 minute)) ts
group by ts
)
group by hour
if applied to sample data in your question - output is
Upvotes: 1
Reputation: 1269503
This is a little complicated, but here is a query:
with t as (
select 1 as id, timestamp('2011-12-19 06:00:00') as startt, timestamp('2011-12-19 08:45:00') as endt union all
select 2 as id, timestamp('2011-12-19 06:15:00') as startt, timestamp('2011-12-19 06:30:00') as endt union all
select 3 as id, timestamp('2011-12-19 06:30:00') as startt, timestamp('2011-12-19 06:45:00') as endt union all
select 4 as id, timestamp('2011-12-19 06:40:00') as startt, timestamp('2011-12-19 07:15:00') as endt union all
select 5 as id, timestamp('2011-12-19 07:15:00') as startt, timestamp('2011-12-19 08:45:00') as endt union all
select 6 as id, timestamp('2011-12-19 07:30:00') as startt, timestamp('2011-12-19 07:50:00') as endt union all
select 7 as id, timestamp('2011-12-19 08:00:00') as startt, timestamp('2011-12-19 08:30:00') as endt union all
select 8 as id, timestamp('2011-12-19 08:00:00') as startt, timestamp('2011-12-19 08:15:00') as endt union all
select 9 as id, timestamp('2011-12-19 08:30:00') as startt, timestamp('2011-12-19 08:45:00') as endt
),
se as (
select id, startt as ts, 1 as inc
from t union all
select id, endt as ts, -1 as inc
from t union all
select null, ts, 0
from unnest(generate_timestamp_array(timestamp('2011-12-19 06:00:00'),
timestamp('2011-12-19 08:00:00'),
interval 1 hour)
) ts
),
p as (
select ts, (inc = 0) as col, sum(inc) as value_at,
countif(inc = 1) as num_starts,
sum(sum(inc)) over (order by ts, max(inc = 0) desc) as active_at,
sum(countif(inc = 0)) over (order by ts, max(inc = 0) desc) as period_grp
from se
group by 1, 2
)
select period_grp, min(ts) as period,
max(active_at) as max_in_period,
(array_agg(active_at order by ts limit 1)[ordinal(1)] +
sum(num_starts)
) as total
from p
group by period_grp;
The key idea is to split the starts and stops into separate rows with an "increment" of +1 or -1. This is then augmented with the hourly breaks that you want.
The code then does the following:
Then the two calculations you want are:
Upvotes: 1
Reputation: 108641
Let's start with a resultset containing all the distinct timestamps in your ev
(event) table. (UNION strips duplicates.)
SELECT start t FROM ev
UNION
SELECT end t FROM ev
Next let's figure out how many sessions are active at each of these points in time. We can do that by using a JOIN to check whether each session is active at the point in time. fiddle.
SELECT COUNT(*) concurrent, t.t
FROM ev
JOIN (
SELECT start t FROM ev
UNION
SELECT end t FROM ev
) t ON ev.start <= t.t AND ev.end > t.t
GROUP BY t.t
If you have many many sessions, this query can do a lot of heavy lifting. You'd be smart, in production, to restrict it by date range, and to put a compound index on (start, end)
.
Finally, group that result set by hour and take the maximum concurrency. fiddle
SELECT DATE_FORMAT(t, '%Y-%m-%d %H:00') hour_beginning,
MAX(concurrent) concurrent
FROM (
SELECT COUNT(*) concurrent, t.t
FROM ev
JOIN (
SELECT start t FROM ev
UNION
SELECT end t FROM ev
) t ON ev.start <= t.t AND ev.end > t.t
GROUP BY t.t
) q
GROUP BY DATE_FORMAT(t, '%Y-%m-%d %H:00')
Notice a couple of things.
DATE_FORMAT(t, '%Y-%m-%d %H:00')
gets you a timestamp that's the beginning of the hour of t
.end
columns in your table record the first moment the session became inactive, not the last moment the session was active. (There are two kinds of hard problems in computer science: naming things, caching things, and off-by-one errors. :-)This is tested on MySQL. BigQuery may vary in its syntax.
Upvotes: 1