Colton Pan
Colton Pan

Reputation: 25

How to group sessions that have timestamps close to each other?

My scenario requires me to look at sessions that are less than 60 seconds apart as the same session.

data is like below.

Min_Timestamp                Max_Timestamp                Device_ID  Session_ID  Prev_Max_Timestamp           Diff_Sec
2019-12-03 23:05:30.416 UTC  2019-12-03 23:09:13.502 UTC  AAAAA      I90HYTRFJI  null                         null
2019-12-03 23:09:21.517 UTC  2019-12-03 23:09:53.353 UTC  AAAAA      98UHIGSNJR  2019-12-03 23:09:13.502 UTC  8
2019-12-03 00:00:28.933 UTC  2019-12-03 00:09:03.473 UTC  BBBBB      32QE8Y76TG  null                         null
2019-12-03 00:09:19.106 UTC  2019-12-03 00:23:26.554 UTC  BBBBB      R4GUY432AD  2019-12-03 00:09:03.473 UTC  16
2019-12-03 00:23:26.818 UTC  2019-12-03 00:23:26.837 UTC  BBBBB      E32GUYE328  2019-12-03 00:23:26.554 UTC  0
2019-12-03 17:00:32.160 UTC  2019-12-03 17:03:48.758 UTC  BBBBB      GY1EW32876  2019-12-03 00:23:26.837 UTC  59825
2019-12-03 17:03:58.069 UTC  2019-12-03 17:17:12.408 UTC  BBBBB      2876T128Y7  2019-12-03 17:03:48.758 UTC  9
2019-12-03 17:18:24.528 UTC  2019-12-03 17:18:27.516 UTC  BBBBB      098U6598U5  2019-12-03 17:17:12.408 UTC  73
2019-12-03 16:30:29.970 UTC  2019-12-03 18:44:18.972 UTC  CCCCC      UWI4UII2J4  null                         null
2019-12-04 17:32:19.285 UTC  2019-12-04 17:32:24.668 UTC  CCCCC      G3247ROIUH  2019-12-03 18:44:18.972 UTC  82080

Group the sessions together that are less than 60 seconds apart while still separate by device. It will look like this.

Min_Timestamp                Max_Timestamp                Device_ID  Session_ID  Prev_Max_Timestamp           Diff_Sec
2019-12-03 23:05:30.416 UTC  2019-12-03 23:09:13.502 UTC  AAAAA      I90HYTRFJI  null                         null
2019-12-03 23:09:21.517 UTC  2019-12-03 23:09:53.353 UTC  AAAAA      98UHIGSNJR  2019-12-03 23:09:13.502 UTC  8

2019-12-03 00:00:28.933 UTC  2019-12-03 00:09:03.473 UTC  BBBBB      32QE8Y76TG  null                         null
2019-12-03 00:09:19.106 UTC  2019-12-03 00:23:26.554 UTC  BBBBB      R4GUY432AD  2019-12-03 00:09:03.473 UTC  16
2019-12-03 00:23:26.818 UTC  2019-12-03 00:23:26.837 UTC  BBBBB      E32GUYE328  2019-12-03 00:23:26.554 UTC  0

2019-12-03 17:00:32.160 UTC  2019-12-03 17:03:48.758 UTC  BBBBB      GY1EW32876  2019-12-03 00:23:26.837 UTC  59825
2019-12-03 17:03:58.069 UTC  2019-12-03 17:17:12.408 UTC  BBBBB      2876T128Y7  2019-12-03 17:03:48.758 UTC  9
2019-12-03 17:18:24.528 UTC  2019-12-03 17:18:27.516 UTC  BBBBB      098U6598U5  2019-12-03 17:17:12.408 UTC  73

2019-12-03 16:30:29.970 UTC  2019-12-03 18:44:18.972 UTC  CCCCC      UWI4UII2J4  null                         null

2019-12-04 17:32:19.285 UTC  2019-12-04 17:32:24.668 UTC  CCCCC      G3247ROIUH  2019-12-03 18:44:18.972 UTC  82080

I want to be able to get something that look like this. Session_ID does not need to be like A1, B1, C1, etc. It can simply be the first value of the session. Notice the Max_Timestamp of the latest one is now the new Max_Timestamp.

Min_Timestamp                Max_Timestamp                Device_ID  Session_ID
2019-12-03 23:05:30.416 UTC  2019-12-03 23:09:53.353 UTC  AAAAA      A1          
2019-12-03 00:00:28.933 UTC  2019-12-03 00:23:26.837 UTC  BBBBB      B1
2019-12-03 17:00:32.160 UTC  2019-12-03 17:18:27.516 UTC  BBBBB      B2
2019-12-03 16:30:29.970 UTC  2019-12-03 18:44:18.972 UTC  CCCCC      C1
2019-12-04 17:32:19.285 UTC  2019-12-04 17:32:24.668 UTC  CCCCC      C2

My idea was to make all Session_ID that belong to the same group the same. Then group by Device_ID and Session_ID to get min(Min_Timestamp) and max(Max_Timestamp). I tried to fiddle with first_value() on Session_ID, but I can't figure out how to partition it correctly.

Best to achieve this in legacy. if not, standard will work,too.

Upvotes: 1

Views: 47

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Below is for BigQuery Standard SQL (if you want - just "translate" it to Legacy - but the advice is to migrate to Standard anyway!!! so do it now and use below)

#standardSQL
SELECT MIN(Min_Timestamp) AS Min_Timestamp, MAX(Max_Timestamp) AS Max_Timestamp, Device_ID, Session_ID
FROM (
  SELECT * EXCEPT(flag, Session_ID), 
    CONCAT(Device_ID, CAST(COUNTIF(flag) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp) AS STRING)) AS Session_ID
  FROM (
    SELECT *, 
      IFNULL(TIMESTAMP_DIFF(Min_Timestamp, LAG(Max_Timestamp) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp), SECOND), 999) > 60 flag
    FROM `project.dataset.table`
  )
)
GROUP BY Device_ID, Session_ID

You can test, play with above using sample data from your question as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT TIMESTAMP '2019-12-03 23:05:30.416 UTC' Min_Timestamp, TIMESTAMP '2019-12-03 23:09:13.502 UTC' Max_Timestamp, 'AAAAA' Device_ID, 'I90HYTRFJI' Session_ID UNION ALL
  SELECT '2019-12-03 23:09:21.517 UTC', '2019-12-03 23:09:53.353 UTC', 'AAAAA', '98UHIGSNJR' UNION ALL
  SELECT '2019-12-03 00:00:28.933 UTC', '2019-12-03 00:09:03.473 UTC', 'BBBBB', '32QE8Y76TG' UNION ALL
  SELECT '2019-12-03 00:09:19.106 UTC', '2019-12-03 00:23:26.554 UTC', 'BBBBB', 'R4GUY432AD' UNION ALL
  SELECT '2019-12-03 00:23:26.818 UTC', '2019-12-03 00:23:26.837 UTC', 'BBBBB', 'E32GUYE328' UNION ALL
  SELECT '2019-12-03 17:00:32.160 UTC', '2019-12-03 17:03:48.758 UTC', 'BBBBB', 'GY1EW32876' UNION ALL
  SELECT '2019-12-03 17:03:58.069 UTC', '2019-12-03 17:17:12.408 UTC', 'BBBBB', '2876T128Y7' UNION ALL
  SELECT '2019-12-03 17:18:24.528 UTC', '2019-12-03 17:18:27.516 UTC', 'BBBBB', '098U6598U5' UNION ALL
  SELECT '2019-12-03 16:30:29.970 UTC', '2019-12-03 18:44:18.972 UTC', 'CCCCC', 'UWI4UII2J4' UNION ALL
  SELECT '2019-12-04 17:32:19.285 UTC', '2019-12-04 17:32:24.668 UTC', 'CCCCC', 'G3247ROIUH' 
)
SELECT MIN(Min_Timestamp) AS Min_Timestamp, MAX(Max_Timestamp) AS Max_Timestamp, Device_ID, Session_ID
FROM (
  SELECT * EXCEPT(flag, Session_ID), 
    CONCAT(Device_ID, CAST(COUNTIF(flag) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp) AS STRING)) AS Session_ID
  FROM (
    SELECT *, 
      IFNULL(TIMESTAMP_DIFF(Min_Timestamp, LAG(Max_Timestamp) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp), SECOND), 999) > 60 flag
    FROM `project.dataset.table`
  )
)
GROUP BY Device_ID, Session_ID
-- ORDER BY Device_ID, Session_ID  

with output

Row Min_Timestamp               Max_Timestamp               Device_ID   Session_ID   
1   2019-12-03 23:05:30.416 UTC 2019-12-03 23:09:53.353 UTC AAAAA       AAAAA1   
2   2019-12-03 00:00:28.933 UTC 2019-12-03 00:23:26.837 UTC BBBBB       BBBBB1   
3   2019-12-03 17:00:32.160 UTC 2019-12-03 17:17:12.408 UTC BBBBB       BBBBB2   
4   2019-12-03 17:18:24.528 UTC 2019-12-03 17:18:27.516 UTC BBBBB       BBBBB3   
5   2019-12-03 16:30:29.970 UTC 2019-12-03 18:44:18.972 UTC CCCCC       CCCCC1   
6   2019-12-04 17:32:19.285 UTC 2019-12-04 17:32:24.668 UTC CCCCC       CCCCC2     

Upvotes: 1

Related Questions