Reputation: 1939
I have a time series (userid, timestamp)
and I'd like to determine sessions.
A session is defined at the user level by a fixed window of 5min which starts for each new row that is not in a previous session (the window duration would ideally depend on the number of previous rows for a given user but for now I can live with a fixed window duration) e.g.
WITH sample AS (
SELECT 0 user_id, TIMESTAMP('2020-01-01T00:00:00Z') timestamp UNION ALL -- new session
SELECT 1, TIMESTAMP('2020-01-01T00:00:00Z') UNION ALL -- new session
SELECT 1, TIMESTAMP('2020-01-01T00:04:00Z') UNION ALL
SELECT 1, TIMESTAMP('2020-01-01T00:06:00Z') UNION ALL -- new session
SELECT 1, TIMESTAMP('2020-01-01T00:10:00Z') UNION ALL
SELECT 1, TIMESTAMP('2020-01-01T00:11:00Z') UNION ALL -- new session
SELECT 1, TIMESTAMP('2020-01-01T01:00:00Z') -- new session
)
SELECT *
FROM sample;
I'm stuck since, in my reasoning, determining whether a row starts a new session depends on the "new session" column value from previous rows.
I'm targeting BigQuery so ideally it should be doable in BQ syntax.
Thanks for any help/hint!
Upvotes: 0
Views: 638
Reputation: 1270021
I think you want lag()
and a cumulative sum:
select s.*,
sum(case when prev_ts > timestamp_add(timestamp, interval -5 minute)
then 0 else 1 -- "1" starts a new session
end) over (partition by user_id order by timestamp) as session_num
from (select s.*,
lag(timestamp) over (partition by user_id order by timestamp) as prev_ts
from sample s
) s
Upvotes: 2