Create additional ranking that resets based on specific value

Question

I want to have a rank that resets to 1 based on the appearance of a value. More specifically shown in the table below, I have a standard ranking partitioned by id, but I want to also have the "session_rank" column that resets to 1 whenever the row takes the value "new session" in the session field:

| timestamp                   | id     | previoustime                | previousid | rnk | session     | session_rank |
|-----------------------------|--------|-----------------------------|------------|-----|-------------|--------------|
| 2019-10-16 19:00:00.000 UTC | abc123 | Null                        | Null       | 1   |             | 1            |
| 2019-10-16 19:01:00.000 UTC | abc123 | 2019-10-16 19:00:00.000 UTC | abc123     | 2   |             | 2            |
| 2019-10-16 19:02:00.000 UTC | abc123 | 2019-10-16 19:01:00.000 UTC | abc123     | 3   |             | 3            |
| 2019-10-16 21:00:00.000 UTC | abc123 | 2019-10-16 19:02:00.000 UTC | abc123     | 4   | new session | 1            |
| 2019-10-16 21:01:00.000 UTC | abc123 | 2019-10-16 21:00:00.000 UTC | abc123     | 5   |             | 2            |
| 2019-10-16 19:00:00.000 UTC | def456 | 2019-10-16 21:01:00.000 UTC | abc123     | 1   |             | 1            |
| 2019-10-16 19:01:00.000 UTC | def456 | 2019-10-16 19:00:00.000 UTC | def456     | 2   |             | 2            |
| 2019-10-16 19:02:00.000 UTC | def456 | 2019-10-16 19:01:00.000 UTC | def456     | 3   |             | 3            |
| 2019-10-16 19:03:00.000 UTC | def456 | 2019-10-16 19:02:00.000 UTC | def456     | 4   |             | 4            |

I'm aware I should be using a window function and have toyed around with a ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING statement in my over clause but haven't quite got it to work. Here's what I have:

SELECT
  *,
  ROW_NUMBER() OVER(PARTITION BY id, sessiongroup ORDER BY timestamp) newrank
FROM (
  SELECT
    *,
    FIRST_VALUE(newrnk ignore nulls) OVER (PARTITION BY id ORDER BY timestamp ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) sessiongroup
  FROM (
    SELECT
      *,
      ROW_NUMBER() OVER(PARTITION BY id, session ORDER BY timestamp ) AS newrnk
    FROM (
      SELECT
        *,
        CASE
          WHEN id=lastid AND TIMESTAMP_DIFF(timestamp, lasttime, minute)>90 THEN "new session"
        ELSE
        ""
      END
        session
      FROM (
        SELECT
          timestamp,
          id,
          LAG(timestamp) OVER(ORDER BY id, timestamp) lasttime,
          LAG(id) OVER(ORDER BY id, timestamp) lastid,
          ROW_NUMBER() OVER(PARTITION BY id ORDER BY timestamp ) AS rnk
        FROM
          `table`
        WHERE
          timestamp >= "2019-10-16 00:00:00"
          AND timestamp < "2019-10-18 00:00:00"
          AND id IS NOT NULL
    ))))
ORDER BY
  id,
  timestamp

There's also the question of whether this is even the optimal method to use in achieving the desired result, as I'm using BigQuery and constant bump up against the memory limit using OVER and ORDER BY. Is there a more memory efficient solution?

Mikhail Berlyant · Accepted Answer

Below is for BigQuery Standard SQL

#standardSQL
SELECT ts, id, ROW_NUMBER() OVER(PARTITION BY id, session ORDER BY ts) rnk
FROM (
  SELECT *, COUNTIF(new_session) OVER(PARTITION BY id ORDER BY ts) session
  FROM (
    SELECT *, IFNULL(TIMESTAMP_DIFF(ts, LAG(ts) OVER(PARTITION BY id ORDER BY ts), MINUTE) > 90, FALSE) new_session  
    FROM `project.dataset.table` 
  )
)

You can test, play with above using sample data from your question as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT TIMESTAMP '2019-10-16 19:00:00.000 UTC' ts, 'abc123' id UNION ALL
  SELECT '2019-10-16 19:01:00.000 UTC', 'abc123' UNION ALL  
  SELECT '2019-10-16 19:02:00.000 UTC', 'abc123' UNION ALL
  SELECT '2019-10-16 21:00:00.000 UTC', 'abc123' UNION ALL
  SELECT '2019-10-16 21:01:00.000 UTC', 'abc123' UNION ALL
  SELECT '2019-10-16 19:00:00.000 UTC', 'def456' UNION ALL
  SELECT '2019-10-16 19:01:00.000 UTC', 'def456' UNION ALL
  SELECT '2019-10-16 19:02:00.000 UTC', 'def456' UNION ALL
  SELECT '2019-10-16 19:03:00.000 UTC', 'def456'  
)
SELECT ts, id, ROW_NUMBER() OVER(PARTITION BY id, session ORDER BY ts) rnk
FROM (
  SELECT *, COUNTIF(new_session) OVER(PARTITION BY id ORDER BY ts) session
  FROM (
    SELECT *, IFNULL(TIMESTAMP_DIFF(ts, LAG(ts) OVER(PARTITION BY id ORDER BY ts), MINUTE) > 90, FALSE) new_session  
    FROM `project.dataset.table` 
  )
)
-- ORDER BY id, ts

with result

Row ts                      id      rnk  
1   2019-10-16 19:00:00 UTC abc123  1    
2   2019-10-16 19:01:00 UTC abc123  2    
3   2019-10-16 19:02:00 UTC abc123  3    
4   2019-10-16 21:00:00 UTC abc123  1    
5   2019-10-16 21:01:00 UTC abc123  2    
6   2019-10-16 19:00:00 UTC def456  1    
7   2019-10-16 19:01:00 UTC def456  2    
8   2019-10-16 19:02:00 UTC def456  3    
9   2019-10-16 19:03:00 UTC def456  4

Create additional ranking that resets based on specific value

Answers (2)

Related Questions