jenswirf
jenswirf

Reputation: 7317

Calculating the longest binge-viewing streak using SQL

I'm attempting to crunch some binge-viewing stats and I'd like to find out how long the longest binge streak is (a binge being multiple programs viewed in concession, one after another, no more than 2 hours apart). The data looks like this:

datetime                user_id program
2013-09-01 00:01:18     1       A
2013-09-10 14:03:14     1       B
2013-09-20 17:02:12     2       A  
2013-09-21 00:03:22     2       C  <-- user 2 binge start
2013-09-21 01:23:22     2       M
2013-09-21 03:03:22     2       E
2013-09-21 04:03:22     2       F  
2013-09-21 06:03:22     2       G  <-- user 2 binge end
2013-09-21 09:03:22     2       H
2013-09-03 18:21:09     3       D
2013-09-21 09:03:22     2       H
2013-09-24 19:21:00     2       X  <-- user 2 second binge start
2013-09-24 20:21:00     2       Y
2013-09-24 21:21:00     2       Z  <-- user 2 second binge end

SQL Fiddle

In this example user 2 had a binge that lasted 6 hours and later another that lasted 2 hours.

The end result I would like is something like:

user_id     binge     length
2           1         6 hours
2           2         2 hours

Can this be calculated directly in the database?

Upvotes: 3

Views: 621

Answers (2)

Noel
Noel

Reputation: 10525

This is a problem of identifying sequences/streak in the data. My preferred way of doing this is,

  • Using LAG function to identify the beginning of each streak
  • Using SUM function to assign a unique number to each streak
  • Then group by this unique number for further processing

Query:

with start_grp as (
  select dt, user_id, programme,
         case when dt - lag(dt,1) over (partition by user_id order by dt) 
                   > interval '0 day 2:00:00'
              then 1
              else 0
         end grp_start
  from binge
  ),
assign_grp as (
  select dt, user_id, programme,
  sum(grp_start) over (partition by user_id order by dt) grp
  from start_grp)
select user_id, grp as binge, max(dt) - min(dt) as binge_length
from assign_grp
group by user_id, grp
having count(programme) > 1

Here binge column may not come in sequential manner. You can use ROW_NUMBER function over the final query to correct it.

Demo at sqlfiddle

Upvotes: 3

Bruno
Bruno

Reputation: 122719

Here is a solution using a recursive CTE (it's not really "recursive", but that's how they're called) and window functions. You'll need PostgreSQL 8.4 at least for this.

SQL Fiddle

PostgreSQL 9.1.9 Schema Setup:

CREATE TABLE viewings (
    user_id INTEGER NOT NULL,
    datetime TIMESTAMPTZ NOT NULL,
    programme TEXT NOT NULL,
    PRIMARY KEY (user_id, datetime)
);

INSERT INTO viewings (datetime, user_id, programme) VALUES
('2013-09-01 00:01:18', 1, 'A'),
('2013-09-10 14:03:14', 1, 'B'),
('2013-09-20 17:02:12', 2, 'A'),
('2013-09-21 00:03:22', 2, 'C'),
('2013-09-21 01:23:22', 2, 'M'),
('2013-09-21 03:03:22', 2, 'E'),
('2013-09-21 04:03:22', 2, 'F'),
('2013-09-21 06:03:22', 2, 'G'),
('2013-09-21 09:03:22', 2, 'H'),
('2013-09-03 18:21:09', 3, 'D'),
('2013-09-22 09:03:22', 2, 'H'),
('2013-09-24 19:21:00', 2, 'X'),
('2013-09-24 20:21:00', 2, 'Y'),
('2013-09-24 21:21:00', 2, 'Z');

Query 1:

WITH RECURSIVE consecutive_viewings(user_id, first_dt, last_dt) AS (
  WITH lagged_viewings AS (
    SELECT user_id, LAG(user_id) OVER w AS prev_user_id,
           datetime, LAG(datetime) OVER w AS prev_datetime,
           programme
    FROM viewings WINDOW w AS (PARTITION BY user_id ORDER BY datetime)
  )
  SELECT user_id, datetime AS first_dt, datetime AS last_dt
    FROM lagged_viewings
    WHERE prev_datetime IS NULL OR (prev_datetime + '2 hours'::interval) < datetime
  UNION ALL
  SELECT lv.user_id, cv.first_dt, lv.datetime AS last_dt
    FROM consecutive_viewings cv
      INNER JOIN lagged_viewings lv
      ON lv.user_id=cv.user_id AND
         lv.prev_datetime=cv.last_dt
      WHERE (lv.prev_datetime + '2 hours'::interval) >= lv.datetime
)
SELECT user_id, first_dt, MAX(last_dt) AS last_dt
   FROM consecutive_viewings
   WHERE first_dt != last_dt
   GROUP BY user_id, first_dt
   ORDER BY user_id, first_dt

Results:

| USER_ID |                         FIRST_DT |                          LAST_DT |
|---------|----------------------------------|----------------------------------|
|       2 | September, 21 2013 00:03:22+0000 | September, 21 2013 06:03:22+0000 |
|       2 | September, 24 2013 19:21:00+0000 | September, 24 2013 21:21:00+0000 |

To understand this, it's probably easier to start with the most nested CTE. This will order the viewings by user_id and datetime, but this will also add an extra column with the timestamp of the previous viewing so that you can chain them later. This is not a recursive CTE (and a CTE isn't even needed for the following query on its own):

Query 2:

WITH lagged_viewings AS (
    SELECT user_id, LAG(user_id) OVER w AS prev_user_id,
           datetime, LAG(datetime) OVER w AS prev_datetime,
           programme
    FROM viewings WINDOW w AS (PARTITION BY user_id ORDER BY datetime)
)
SELECT * FROM lagged_viewings

Results:

| USER_ID | PREV_USER_ID |                         DATETIME |                    PREV_DATETIME | PROGRAMME |
|---------|--------------|----------------------------------|----------------------------------|-----------|
|       1 |       (null) | September, 01 2013 00:01:18+0000 |                           (null) |         A |
|       1 |            1 | September, 10 2013 14:03:14+0000 | September, 01 2013 00:01:18+0000 |         B |
|       2 |       (null) | September, 20 2013 17:02:12+0000 |                           (null) |         A |
|       2 |            2 | September, 21 2013 00:03:22+0000 | September, 20 2013 17:02:12+0000 |         C |
|       2 |            2 | September, 21 2013 01:23:22+0000 | September, 21 2013 00:03:22+0000 |         M |
|       2 |            2 | September, 21 2013 03:03:22+0000 | September, 21 2013 01:23:22+0000 |         E |
|       2 |            2 | September, 21 2013 04:03:22+0000 | September, 21 2013 03:03:22+0000 |         F |
|       2 |            2 | September, 21 2013 06:03:22+0000 | September, 21 2013 04:03:22+0000 |         G |
|       2 |            2 | September, 21 2013 09:03:22+0000 | September, 21 2013 06:03:22+0000 |         H |
|       2 |            2 | September, 22 2013 09:03:22+0000 | September, 21 2013 09:03:22+0000 |         H |
|       2 |            2 | September, 24 2013 19:21:00+0000 | September, 22 2013 09:03:22+0000 |         X |
|       2 |            2 | September, 24 2013 20:21:00+0000 | September, 24 2013 19:21:00+0000 |         Y |
|       2 |            2 | September, 24 2013 21:21:00+0000 | September, 24 2013 20:21:00+0000 |         Z |
|       3 |       (null) | September, 03 2013 18:21:09+0000 |                           (null) |         D |

This recursive CTE id probably a bit trickier to understand. The "recursion" relies on a union between two select statements.

  • The first one seeds the iterations (it's the non-recursive part): it will find the rows that are the start of a chain of viewings (i.e. where the previous datetime is null if it's the first for that user, or the previous datetime is further than your cut-off interval).
  • The second one chains that viewing to make a longer period. Some durations will overlap, because it doesn't know when it ends. This is where the conditions (in the overall query at the top) are used, to find the max and eliminate the periods with a single viewing.

Query 3:

WITH RECURSIVE consecutive_viewings(user_id, first_dt, last_dt) AS (
  WITH lagged_viewings AS (
    SELECT user_id, LAG(user_id) OVER w AS prev_user_id,
           datetime, LAG(datetime) OVER w AS prev_datetime,
           programme
    FROM viewings WINDOW w AS (PARTITION BY user_id ORDER BY datetime)
  )
  -- These are the starts of the "binge" durations
  SELECT user_id, datetime AS first_dt, datetime AS last_dt
    FROM lagged_viewings
    WHERE prev_datetime IS NULL OR (prev_datetime + '2 hours'::interval) < datetime
  UNION ALL
  -- These are the extended periods
  SELECT lv.user_id, cv.first_dt, lv.datetime AS last_dt
    FROM consecutive_viewings cv
      INNER JOIN lagged_viewings lv
      ON lv.user_id=cv.user_id AND
         lv.prev_datetime=cv.last_dt
      WHERE (lv.prev_datetime + '2 hours'::interval) >= lv.datetime
)
SELECT * FROM consecutive_viewings
   ORDER BY user_id, first_dt, last_dt

Results:

| USER_ID |                         FIRST_DT |                          LAST_DT |
|---------|----------------------------------|----------------------------------|
|       1 | September, 01 2013 00:01:18+0000 | September, 01 2013 00:01:18+0000 |
|       1 | September, 10 2013 14:03:14+0000 | September, 10 2013 14:03:14+0000 |
|       2 | September, 20 2013 17:02:12+0000 | September, 20 2013 17:02:12+0000 |
|       2 | September, 21 2013 00:03:22+0000 | September, 21 2013 00:03:22+0000 |
|       2 | September, 21 2013 00:03:22+0000 | September, 21 2013 01:23:22+0000 |
|       2 | September, 21 2013 00:03:22+0000 | September, 21 2013 03:03:22+0000 |
|       2 | September, 21 2013 00:03:22+0000 | September, 21 2013 04:03:22+0000 |
|       2 | September, 21 2013 00:03:22+0000 | September, 21 2013 06:03:22+0000 |
|       2 | September, 21 2013 09:03:22+0000 | September, 21 2013 09:03:22+0000 |
|       2 | September, 22 2013 09:03:22+0000 | September, 22 2013 09:03:22+0000 |
|       2 | September, 24 2013 19:21:00+0000 | September, 24 2013 19:21:00+0000 |
|       2 | September, 24 2013 19:21:00+0000 | September, 24 2013 20:21:00+0000 |
|       2 | September, 24 2013 19:21:00+0000 | September, 24 2013 21:21:00+0000 |
|       3 | September, 03 2013 18:21:09+0000 | September, 03 2013 18:21:09+0000 |

Upvotes: 1

Related Questions