koper89
koper89

Reputation: 655

Find max overlapping intervals in periods of time

I want to find maximum number of overlapping intervals I have in given period of time. Data have always start and end timestamp. And for given period of time (i.e. hourly) I want to get how many in total unique rows I had that was in given time, and a bit more troublesome maximum of concurrent ones in it.

Sample data:

id start end
1 2011-12-19 06:00:00 2011-12-19 08:45:00
2 2011-12-19 06:15:00 2011-12-19 06:30:00
3 2011-12-19 06:30:00 2011-12-19 06:45:00
4 2011-12-19 06:40:00 2011-12-19 07:15:00
5 2011-12-19 07:15:00 2011-12-19 08:45:00
6 2011-12-19 07:30:00 2011-12-19 07:50:00
7 2011-12-19 08:00:00 2011-12-19 08:30:00
8 2011-12-19 08:00:00 2011-12-19 08:15:00
9 2011-12-19 08:30:00 2011-12-19 08:45:00

For this data hourly result would look like:

id period max total
1 2011-12-18 06:00:00 - 2011-12-19 07:00:00 3 4
2 2011-12-19 07:00:00 - 2011-12-19 08:00:00 3 4
3 2011-12-19 08:00:00 - 2011-12-19 09:00:00 4 5

Where max (max concurrent) would be:

2011-12-18 06:00:00 - Concurrent sessions: (2,1), (3,1,4) Total: 1,2,3,4

2011-12-18 07:00:00 - Concurrent sessions: (1,4), (5,1,6) Total: 1,4,5,6

2011-12-18 08:00:00 - Concurrent sessions: (1,5,7,8), (9,1,5) Total: 1,5,7,8,9

Any ideas how I could achieve something like this using SQL (BigQuery)?

Upvotes: 0

Views: 1195

Answers (3)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172964

Consider below approach - seems to me most simple and least verbose

select 
  timestamp_trunc(ts, hour) hour, 
  max(concurrent) `max`,
  hll_count.merge(ids) total
from (
  select ts, count(distinct id) concurrent, hll_count.init(id) ids
  from `project.dataset.table`, 
  unnest(generate_timestamp_array(start, `end`, interval 1 minute)) ts
  group by ts 
) 
group by hour         

if applied to sample data in your question - output is

enter image description here

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269503

This is a little complicated, but here is a query:

with t as (
      select 1 as id, timestamp('2011-12-19 06:00:00') as startt, timestamp('2011-12-19 08:45:00') as endt union all
      select 2 as id, timestamp('2011-12-19 06:15:00') as startt, timestamp('2011-12-19 06:30:00') as endt union all
      select 3 as id, timestamp('2011-12-19 06:30:00') as startt, timestamp('2011-12-19 06:45:00') as endt union all
      select 4 as id, timestamp('2011-12-19 06:40:00') as startt, timestamp('2011-12-19 07:15:00') as endt union all
      select 5 as id, timestamp('2011-12-19 07:15:00') as startt, timestamp('2011-12-19 08:45:00') as endt union all
      select 6 as id, timestamp('2011-12-19 07:30:00') as startt, timestamp('2011-12-19 07:50:00') as endt union all
      select 7 as id, timestamp('2011-12-19 08:00:00') as startt, timestamp('2011-12-19 08:30:00') as endt union all
      select 8 as id, timestamp('2011-12-19 08:00:00') as startt, timestamp('2011-12-19 08:15:00') as endt union all
      select 9 as id, timestamp('2011-12-19 08:30:00') as startt, timestamp('2011-12-19 08:45:00') as endt 
     ),
     se as (
         select id, startt as ts, 1 as inc
         from t union all
         select id, endt as ts, -1 as inc
         from t union all
         select null, ts, 0
         from unnest(generate_timestamp_array(timestamp('2011-12-19 06:00:00'),
                                              timestamp('2011-12-19 08:00:00'),
                                              interval 1 hour)
                    ) ts
     ),
     p as (
         select ts, (inc = 0) as col, sum(inc) as value_at,
                countif(inc = 1) as num_starts,
                sum(sum(inc)) over (order by ts, max(inc = 0) desc) as active_at,
                sum(countif(inc = 0)) over (order by ts, max(inc = 0) desc) as period_grp
         from se
         group by 1, 2
     )
select period_grp, min(ts) as period,
       max(active_at) as max_in_period,
       (array_agg(active_at order by ts limit 1)[ordinal(1)] +
        sum(num_starts)
       ) as total
from p
group by period_grp;

The key idea is to split the starts and stops into separate rows with an "increment" of +1 or -1. This is then augmented with the hourly breaks that you want.

The code then does the following:

  • Calculate the cumulative sum of the increment to get the number of concurrent ids at each timestamp.
  • Calculates the "period" for each timestamp by taking a cumulative sum of the generated rows.

Then the two calculations you want are:

  1. The max is simply the max of the concurrent in a group by.
  2. The total is the concurrent at the beginning of the time period (not including any that start at the beginning of the time period) plus any starts during the time period.

Upvotes: 1

O. Jones
O. Jones

Reputation: 108641

Let's start with a resultset containing all the distinct timestamps in your ev (event) table. (UNION strips duplicates.)

 SELECT start t FROM ev
  UNION 
 SELECT end t FROM ev

Next let's figure out how many sessions are active at each of these points in time. We can do that by using a JOIN to check whether each session is active at the point in time. fiddle.

SELECT COUNT(*) concurrent, t.t
FROM ev 
JOIN (
    SELECT start t FROM ev
    UNION 
    SELECT end t FROM ev
) t ON ev.start <= t.t AND ev.end > t.t
GROUP BY t.t

If you have many many sessions, this query can do a lot of heavy lifting. You'd be smart, in production, to restrict it by date range, and to put a compound index on (start, end).

Finally, group that result set by hour and take the maximum concurrency. fiddle

SELECT DATE_FORMAT(t, '%Y-%m-%d %H:00') hour_beginning,
       MAX(concurrent) concurrent
  FROM (
         SELECT COUNT(*) concurrent, t.t
           FROM ev 
           JOIN (
                  SELECT start t FROM ev
                  UNION 
                  SELECT end t FROM ev
       ) t ON ev.start <= t.t AND ev.end > t.t
   GROUP BY t.t
   ) q
GROUP BY DATE_FORMAT(t, '%Y-%m-%d %H:00')

Notice a couple of things.

  1. The expression DATE_FORMAT(t, '%Y-%m-%d %H:00') gets you a timestamp that's the beginning of the hour of t.
  2. To work perfectly, this assumes the end columns in your table record the first moment the session became inactive, not the last moment the session was active. (There are two kinds of hard problems in computer science: naming things, caching things, and off-by-one errors. :-)

This is tested on MySQL. BigQuery may vary in its syntax.

Upvotes: 1

Related Questions