Reputation: 7317
I'm attempting to crunch some binge-viewing stats and I'd like to find out how long the longest binge streak is (a binge being multiple programs viewed in concession, one after another, no more than 2 hours apart). The data looks like this:
datetime user_id program
2013-09-01 00:01:18 1 A
2013-09-10 14:03:14 1 B
2013-09-20 17:02:12 2 A
2013-09-21 00:03:22 2 C <-- user 2 binge start
2013-09-21 01:23:22 2 M
2013-09-21 03:03:22 2 E
2013-09-21 04:03:22 2 F
2013-09-21 06:03:22 2 G <-- user 2 binge end
2013-09-21 09:03:22 2 H
2013-09-03 18:21:09 3 D
2013-09-21 09:03:22 2 H
2013-09-24 19:21:00 2 X <-- user 2 second binge start
2013-09-24 20:21:00 2 Y
2013-09-24 21:21:00 2 Z <-- user 2 second binge end
In this example user 2 had a binge that lasted 6 hours and later another that lasted 2 hours.
The end result I would like is something like:
user_id binge length
2 1 6 hours
2 2 2 hours
Can this be calculated directly in the database?
Upvotes: 3
Views: 621
Reputation: 10525
This is a problem of identifying sequences/streak in the data. My preferred way of doing this is,
Query:
with start_grp as (
select dt, user_id, programme,
case when dt - lag(dt,1) over (partition by user_id order by dt)
> interval '0 day 2:00:00'
then 1
else 0
end grp_start
from binge
),
assign_grp as (
select dt, user_id, programme,
sum(grp_start) over (partition by user_id order by dt) grp
from start_grp)
select user_id, grp as binge, max(dt) - min(dt) as binge_length
from assign_grp
group by user_id, grp
having count(programme) > 1
Here binge column may not come in sequential manner. You can use ROW_NUMBER function over the final query to correct it.
Demo at sqlfiddle
Upvotes: 3
Reputation: 122719
Here is a solution using a recursive CTE (it's not really "recursive", but that's how they're called) and window functions. You'll need PostgreSQL 8.4 at least for this.
PostgreSQL 9.1.9 Schema Setup:
CREATE TABLE viewings (
user_id INTEGER NOT NULL,
datetime TIMESTAMPTZ NOT NULL,
programme TEXT NOT NULL,
PRIMARY KEY (user_id, datetime)
);
INSERT INTO viewings (datetime, user_id, programme) VALUES
('2013-09-01 00:01:18', 1, 'A'),
('2013-09-10 14:03:14', 1, 'B'),
('2013-09-20 17:02:12', 2, 'A'),
('2013-09-21 00:03:22', 2, 'C'),
('2013-09-21 01:23:22', 2, 'M'),
('2013-09-21 03:03:22', 2, 'E'),
('2013-09-21 04:03:22', 2, 'F'),
('2013-09-21 06:03:22', 2, 'G'),
('2013-09-21 09:03:22', 2, 'H'),
('2013-09-03 18:21:09', 3, 'D'),
('2013-09-22 09:03:22', 2, 'H'),
('2013-09-24 19:21:00', 2, 'X'),
('2013-09-24 20:21:00', 2, 'Y'),
('2013-09-24 21:21:00', 2, 'Z');
Query 1:
WITH RECURSIVE consecutive_viewings(user_id, first_dt, last_dt) AS (
WITH lagged_viewings AS (
SELECT user_id, LAG(user_id) OVER w AS prev_user_id,
datetime, LAG(datetime) OVER w AS prev_datetime,
programme
FROM viewings WINDOW w AS (PARTITION BY user_id ORDER BY datetime)
)
SELECT user_id, datetime AS first_dt, datetime AS last_dt
FROM lagged_viewings
WHERE prev_datetime IS NULL OR (prev_datetime + '2 hours'::interval) < datetime
UNION ALL
SELECT lv.user_id, cv.first_dt, lv.datetime AS last_dt
FROM consecutive_viewings cv
INNER JOIN lagged_viewings lv
ON lv.user_id=cv.user_id AND
lv.prev_datetime=cv.last_dt
WHERE (lv.prev_datetime + '2 hours'::interval) >= lv.datetime
)
SELECT user_id, first_dt, MAX(last_dt) AS last_dt
FROM consecutive_viewings
WHERE first_dt != last_dt
GROUP BY user_id, first_dt
ORDER BY user_id, first_dt
| USER_ID | FIRST_DT | LAST_DT |
|---------|----------------------------------|----------------------------------|
| 2 | September, 21 2013 00:03:22+0000 | September, 21 2013 06:03:22+0000 |
| 2 | September, 24 2013 19:21:00+0000 | September, 24 2013 21:21:00+0000 |
To understand this, it's probably easier to start with the most nested CTE. This will order the viewings by user_id
and datetime
, but this will also add an extra column with the timestamp of the previous viewing so that you can chain them later. This is not a recursive CTE (and a CTE isn't even needed for the following query on its own):
Query 2:
WITH lagged_viewings AS (
SELECT user_id, LAG(user_id) OVER w AS prev_user_id,
datetime, LAG(datetime) OVER w AS prev_datetime,
programme
FROM viewings WINDOW w AS (PARTITION BY user_id ORDER BY datetime)
)
SELECT * FROM lagged_viewings
| USER_ID | PREV_USER_ID | DATETIME | PREV_DATETIME | PROGRAMME |
|---------|--------------|----------------------------------|----------------------------------|-----------|
| 1 | (null) | September, 01 2013 00:01:18+0000 | (null) | A |
| 1 | 1 | September, 10 2013 14:03:14+0000 | September, 01 2013 00:01:18+0000 | B |
| 2 | (null) | September, 20 2013 17:02:12+0000 | (null) | A |
| 2 | 2 | September, 21 2013 00:03:22+0000 | September, 20 2013 17:02:12+0000 | C |
| 2 | 2 | September, 21 2013 01:23:22+0000 | September, 21 2013 00:03:22+0000 | M |
| 2 | 2 | September, 21 2013 03:03:22+0000 | September, 21 2013 01:23:22+0000 | E |
| 2 | 2 | September, 21 2013 04:03:22+0000 | September, 21 2013 03:03:22+0000 | F |
| 2 | 2 | September, 21 2013 06:03:22+0000 | September, 21 2013 04:03:22+0000 | G |
| 2 | 2 | September, 21 2013 09:03:22+0000 | September, 21 2013 06:03:22+0000 | H |
| 2 | 2 | September, 22 2013 09:03:22+0000 | September, 21 2013 09:03:22+0000 | H |
| 2 | 2 | September, 24 2013 19:21:00+0000 | September, 22 2013 09:03:22+0000 | X |
| 2 | 2 | September, 24 2013 20:21:00+0000 | September, 24 2013 19:21:00+0000 | Y |
| 2 | 2 | September, 24 2013 21:21:00+0000 | September, 24 2013 20:21:00+0000 | Z |
| 3 | (null) | September, 03 2013 18:21:09+0000 | (null) | D |
This recursive CTE id probably a bit trickier to understand. The "recursion" relies on a union between two select statements.
Query 3:
WITH RECURSIVE consecutive_viewings(user_id, first_dt, last_dt) AS (
WITH lagged_viewings AS (
SELECT user_id, LAG(user_id) OVER w AS prev_user_id,
datetime, LAG(datetime) OVER w AS prev_datetime,
programme
FROM viewings WINDOW w AS (PARTITION BY user_id ORDER BY datetime)
)
-- These are the starts of the "binge" durations
SELECT user_id, datetime AS first_dt, datetime AS last_dt
FROM lagged_viewings
WHERE prev_datetime IS NULL OR (prev_datetime + '2 hours'::interval) < datetime
UNION ALL
-- These are the extended periods
SELECT lv.user_id, cv.first_dt, lv.datetime AS last_dt
FROM consecutive_viewings cv
INNER JOIN lagged_viewings lv
ON lv.user_id=cv.user_id AND
lv.prev_datetime=cv.last_dt
WHERE (lv.prev_datetime + '2 hours'::interval) >= lv.datetime
)
SELECT * FROM consecutive_viewings
ORDER BY user_id, first_dt, last_dt
| USER_ID | FIRST_DT | LAST_DT |
|---------|----------------------------------|----------------------------------|
| 1 | September, 01 2013 00:01:18+0000 | September, 01 2013 00:01:18+0000 |
| 1 | September, 10 2013 14:03:14+0000 | September, 10 2013 14:03:14+0000 |
| 2 | September, 20 2013 17:02:12+0000 | September, 20 2013 17:02:12+0000 |
| 2 | September, 21 2013 00:03:22+0000 | September, 21 2013 00:03:22+0000 |
| 2 | September, 21 2013 00:03:22+0000 | September, 21 2013 01:23:22+0000 |
| 2 | September, 21 2013 00:03:22+0000 | September, 21 2013 03:03:22+0000 |
| 2 | September, 21 2013 00:03:22+0000 | September, 21 2013 04:03:22+0000 |
| 2 | September, 21 2013 00:03:22+0000 | September, 21 2013 06:03:22+0000 |
| 2 | September, 21 2013 09:03:22+0000 | September, 21 2013 09:03:22+0000 |
| 2 | September, 22 2013 09:03:22+0000 | September, 22 2013 09:03:22+0000 |
| 2 | September, 24 2013 19:21:00+0000 | September, 24 2013 19:21:00+0000 |
| 2 | September, 24 2013 19:21:00+0000 | September, 24 2013 20:21:00+0000 |
| 2 | September, 24 2013 19:21:00+0000 | September, 24 2013 21:21:00+0000 |
| 3 | September, 03 2013 18:21:09+0000 | September, 03 2013 18:21:09+0000 |
Upvotes: 1