Sofiane Daoud
Sofiane Daoud

Reputation: 878

BigQuery average time with gap constraint

I have a BigQuery table representing messages, every message belongs to a conversation and has a date:

conversation    date    
1               2018-06-22 23:16:46.456 UTC
2               2018-06-05 00:07:12.178 UTC
1               2018-06-22 23:16:46.456 UTC
4               2018-06-05 00:07:12.178 UTC
3               2018-06-22 23:51:28.540 UTC
3               2018-06-23 00:02:59.285 UTC
4               2018-06-04 23:21:59.500 UTC

I need to get the average time spent in a conversation

I used this query to get it:

SELECT conversation, timestamp_diff(MAX(date), MIN(date), MINUTE) minutes
FROM `Message`
GROUP BY conversation

But since some conversations take several days, they have to be cut into smaller chunks when a the gap between messages is larger than 1 hour example:

conversation    date    
2               2018-06-22 00:01:46.456 UTC   # group 1
2               2018-06-22 00:07:12.178 UTC   # group 1
2               2018-06-22 00:16:46.456 UTC   # group 1
2               2018-06-22 01:07:42.178 UTC   # group 1
      there is a gap here
2               2018-06-22 12:51:28.540 UTC   # group 2
2               2018-06-22 13:00:40.486 UTC   # group 2
      there is another gap here
2               2018-06-22 19:54:30.031 UTC   # group 3

I think this is possible using Analytic Functions: https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts

But I didn't figure out how to do it, any help would be really appriciated.

Upvotes: 4

Views: 116

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173028

Below for BigQuery Standard SQL

they have to be cut into smaller chunks when a the gap between messages is larger than 1 hour:

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 2 conversation, TIMESTAMP '2018-06-22 00:01:46.456 UTC' dt UNION ALL   # group 1
  SELECT 2, '2018-06-22 00:07:12.178 UTC' UNION ALL   # group 1
  SELECT 2, '2018-06-22 00:16:46.456 UTC' UNION ALL   # group 1
  SELECT 2, '2018-06-22 01:07:42.178 UTC' UNION ALL   # group 1
  SELECT 2, '2018-06-22 12:51:28.540 UTC' UNION ALL   # group 2
  SELECT 2, '2018-06-22 13:00:40.486 UTC' UNION ALL   # group 2
  SELECT 2, '2018-06-22 19:54:30.031 UTC'             # group 3
), conversation_groups AS (
  SELECT 
    conversation, dt, 
    SUM(flag) OVER(PARTITION BY conversation ORDER BY dt) conversation_group
  FROM (
    SELECT 
      conversation, dt, 
      SIGN(IFNULL(TIMESTAMP_DIFF(dt, LAG(dt) OVER(PARTITION BY conversation ORDER BY dt), HOUR), 0)) flag
    FROM `project.dataset.table`
  )
)
SELECT *
FROM conversation_groups
ORDER BY conversation, dt   

with result as

Row conversation    dt                              conversation_group   
1   2               2018-06-22 00:01:46.456 UTC     0    
2   2               2018-06-22 00:07:12.178 UTC     0    
3   2               2018-06-22 00:16:46.456 UTC     0    
4   2               2018-06-22 01:07:42.178 UTC     0    
5   2               2018-06-22 12:51:28.540 UTC     1    
6   2               2018-06-22 13:00:40.486 UTC     1    
7   2               2018-06-22 19:54:30.031 UTC     2      

I need to get the average time spent in a conversation

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 2 conversation, TIMESTAMP '2018-06-22 00:01:46.456 UTC' dt UNION ALL   # group 1
  SELECT 2, '2018-06-22 00:07:12.178 UTC' UNION ALL   # group 1
  SELECT 2, '2018-06-22 00:16:46.456 UTC' UNION ALL   # group 1
  SELECT 2, '2018-06-22 01:07:42.178 UTC' UNION ALL   # group 1
  SELECT 2, '2018-06-22 12:51:28.540 UTC' UNION ALL   # group 2
  SELECT 2, '2018-06-22 13:00:40.486 UTC' UNION ALL   # group 2
  SELECT 2, '2018-06-22 19:54:30.031 UTC'             # group 3
), conversation_groups AS (
  SELECT 
    conversation, dt, 
    SUM(flag) OVER(PARTITION BY conversation ORDER BY dt) conversation_group
  FROM (
    SELECT 
      conversation, dt, 
      SIGN(IFNULL(TIMESTAMP_DIFF(dt, LAG(dt) OVER(PARTITION BY conversation ORDER BY dt), HOUR), 0)) flag
    FROM `project.dataset.table`
  )
)
SELECT conversation, AVG(IF(duration = 0, NULL, duration)) avg_duration
FROM (
  SELECT
    conversation, conversation_group,
    TIMESTAMP_DIFF(MAX(dt), MIN(dt), MINUTE) duration
  FROM conversation_groups
  GROUP BY conversation, conversation_group
)
GROUP BY conversation
ORDER BY conversation

with result as

Row conversation    avg_duration     
1   2               37.0     

Note: you can adjust logic of calculating avg based on your particular needs/vision of how to do so - but the way it is done above is - first duration of each group is calculated and then average of duration in these groups are taken - and note: if there is a duration of zero it is replaced with NULL so it does not affect average calculation. Duration is calculated in MINUTEs, but you can go with SECONDs or whatever you need

Upvotes: 4

Related Questions