Reputation: 129
I want to have a rank that resets to 1 based on the appearance of a value. More specifically shown in the table below, I have a standard ranking partitioned by id, but I want to also have the "session_rank" column that resets to 1 whenever the row takes the value "new session" in the session field:
| timestamp | id | previoustime | previousid | rnk | session | session_rank |
|-----------------------------|--------|-----------------------------|------------|-----|-------------|--------------|
| 2019-10-16 19:00:00.000 UTC | abc123 | Null | Null | 1 | | 1 |
| 2019-10-16 19:01:00.000 UTC | abc123 | 2019-10-16 19:00:00.000 UTC | abc123 | 2 | | 2 |
| 2019-10-16 19:02:00.000 UTC | abc123 | 2019-10-16 19:01:00.000 UTC | abc123 | 3 | | 3 |
| 2019-10-16 21:00:00.000 UTC | abc123 | 2019-10-16 19:02:00.000 UTC | abc123 | 4 | new session | 1 |
| 2019-10-16 21:01:00.000 UTC | abc123 | 2019-10-16 21:00:00.000 UTC | abc123 | 5 | | 2 |
| 2019-10-16 19:00:00.000 UTC | def456 | 2019-10-16 21:01:00.000 UTC | abc123 | 1 | | 1 |
| 2019-10-16 19:01:00.000 UTC | def456 | 2019-10-16 19:00:00.000 UTC | def456 | 2 | | 2 |
| 2019-10-16 19:02:00.000 UTC | def456 | 2019-10-16 19:01:00.000 UTC | def456 | 3 | | 3 |
| 2019-10-16 19:03:00.000 UTC | def456 | 2019-10-16 19:02:00.000 UTC | def456 | 4 | | 4 |
I'm aware I should be using a window function and have toyed around with a ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
statement in my over clause but haven't quite got it to work.
Here's what I have:
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY id, sessiongroup ORDER BY timestamp) newrank
FROM (
SELECT
*,
FIRST_VALUE(newrnk ignore nulls) OVER (PARTITION BY id ORDER BY timestamp ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) sessiongroup
FROM (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY id, session ORDER BY timestamp ) AS newrnk
FROM (
SELECT
*,
CASE
WHEN id=lastid AND TIMESTAMP_DIFF(timestamp, lasttime, minute)>90 THEN "new session"
ELSE
""
END
session
FROM (
SELECT
timestamp,
id,
LAG(timestamp) OVER(ORDER BY id, timestamp) lasttime,
LAG(id) OVER(ORDER BY id, timestamp) lastid,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY timestamp ) AS rnk
FROM
`table`
WHERE
timestamp >= "2019-10-16 00:00:00"
AND timestamp < "2019-10-18 00:00:00"
AND id IS NOT NULL
))))
ORDER BY
id,
timestamp
There's also the question of whether this is even the optimal method to use in achieving the desired result, as I'm using BigQuery and constant bump up against the memory limit using OVER
and ORDER BY
. Is there a more memory efficient solution?
Upvotes: 0
Views: 59
Reputation: 172993
Below is for BigQuery Standard SQL
#standardSQL
SELECT ts, id, ROW_NUMBER() OVER(PARTITION BY id, session ORDER BY ts) rnk
FROM (
SELECT *, COUNTIF(new_session) OVER(PARTITION BY id ORDER BY ts) session
FROM (
SELECT *, IFNULL(TIMESTAMP_DIFF(ts, LAG(ts) OVER(PARTITION BY id ORDER BY ts), MINUTE) > 90, FALSE) new_session
FROM `project.dataset.table`
)
)
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT TIMESTAMP '2019-10-16 19:00:00.000 UTC' ts, 'abc123' id UNION ALL
SELECT '2019-10-16 19:01:00.000 UTC', 'abc123' UNION ALL
SELECT '2019-10-16 19:02:00.000 UTC', 'abc123' UNION ALL
SELECT '2019-10-16 21:00:00.000 UTC', 'abc123' UNION ALL
SELECT '2019-10-16 21:01:00.000 UTC', 'abc123' UNION ALL
SELECT '2019-10-16 19:00:00.000 UTC', 'def456' UNION ALL
SELECT '2019-10-16 19:01:00.000 UTC', 'def456' UNION ALL
SELECT '2019-10-16 19:02:00.000 UTC', 'def456' UNION ALL
SELECT '2019-10-16 19:03:00.000 UTC', 'def456'
)
SELECT ts, id, ROW_NUMBER() OVER(PARTITION BY id, session ORDER BY ts) rnk
FROM (
SELECT *, COUNTIF(new_session) OVER(PARTITION BY id ORDER BY ts) session
FROM (
SELECT *, IFNULL(TIMESTAMP_DIFF(ts, LAG(ts) OVER(PARTITION BY id ORDER BY ts), MINUTE) > 90, FALSE) new_session
FROM `project.dataset.table`
)
)
-- ORDER BY id, ts
with result
Row ts id rnk
1 2019-10-16 19:00:00 UTC abc123 1
2 2019-10-16 19:01:00 UTC abc123 2
3 2019-10-16 19:02:00 UTC abc123 3
4 2019-10-16 21:00:00 UTC abc123 1
5 2019-10-16 21:01:00 UTC abc123 2
6 2019-10-16 19:00:00 UTC def456 1
7 2019-10-16 19:01:00 UTC def456 2
8 2019-10-16 19:02:00 UTC def456 3
9 2019-10-16 19:03:00 UTC def456 4
Upvotes: 2
Reputation: 1269873
Use a cumulative countif()
to assign the sessions:
select t.*,
row_number() over (partition by id, grp order by timestamp) as with_session_counter
from (select t.*,
countif(session = 'new session') over (partition by id order by timestamp) as grp
from t
) t
Upvotes: 0