Reputation: 878
I have a BigQuery table representing messages, every message belongs to a conversation and has a date:
conversation date
1 2018-06-22 23:16:46.456 UTC
2 2018-06-05 00:07:12.178 UTC
1 2018-06-22 23:16:46.456 UTC
4 2018-06-05 00:07:12.178 UTC
3 2018-06-22 23:51:28.540 UTC
3 2018-06-23 00:02:59.285 UTC
4 2018-06-04 23:21:59.500 UTC
I need to get the average time spent in a conversation
I used this query to get it:
SELECT conversation, timestamp_diff(MAX(date), MIN(date), MINUTE) minutes
FROM `Message`
GROUP BY conversation
But since some conversations take several days, they have to be cut into smaller chunks when a the gap between messages is larger than 1 hour example:
conversation date
2 2018-06-22 00:01:46.456 UTC # group 1
2 2018-06-22 00:07:12.178 UTC # group 1
2 2018-06-22 00:16:46.456 UTC # group 1
2 2018-06-22 01:07:42.178 UTC # group 1
there is a gap here
2 2018-06-22 12:51:28.540 UTC # group 2
2 2018-06-22 13:00:40.486 UTC # group 2
there is another gap here
2 2018-06-22 19:54:30.031 UTC # group 3
I think this is possible using Analytic Functions: https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts
But I didn't figure out how to do it, any help would be really appriciated.
Upvotes: 4
Views: 116
Reputation: 173028
Below for BigQuery Standard SQL
they have to be cut into smaller chunks when a the gap between messages is larger than 1 hour:
#standardSQL
WITH `project.dataset.table` AS (
SELECT 2 conversation, TIMESTAMP '2018-06-22 00:01:46.456 UTC' dt UNION ALL # group 1
SELECT 2, '2018-06-22 00:07:12.178 UTC' UNION ALL # group 1
SELECT 2, '2018-06-22 00:16:46.456 UTC' UNION ALL # group 1
SELECT 2, '2018-06-22 01:07:42.178 UTC' UNION ALL # group 1
SELECT 2, '2018-06-22 12:51:28.540 UTC' UNION ALL # group 2
SELECT 2, '2018-06-22 13:00:40.486 UTC' UNION ALL # group 2
SELECT 2, '2018-06-22 19:54:30.031 UTC' # group 3
), conversation_groups AS (
SELECT
conversation, dt,
SUM(flag) OVER(PARTITION BY conversation ORDER BY dt) conversation_group
FROM (
SELECT
conversation, dt,
SIGN(IFNULL(TIMESTAMP_DIFF(dt, LAG(dt) OVER(PARTITION BY conversation ORDER BY dt), HOUR), 0)) flag
FROM `project.dataset.table`
)
)
SELECT *
FROM conversation_groups
ORDER BY conversation, dt
with result as
Row conversation dt conversation_group
1 2 2018-06-22 00:01:46.456 UTC 0
2 2 2018-06-22 00:07:12.178 UTC 0
3 2 2018-06-22 00:16:46.456 UTC 0
4 2 2018-06-22 01:07:42.178 UTC 0
5 2 2018-06-22 12:51:28.540 UTC 1
6 2 2018-06-22 13:00:40.486 UTC 1
7 2 2018-06-22 19:54:30.031 UTC 2
I need to get the average time spent in a conversation
#standardSQL
WITH `project.dataset.table` AS (
SELECT 2 conversation, TIMESTAMP '2018-06-22 00:01:46.456 UTC' dt UNION ALL # group 1
SELECT 2, '2018-06-22 00:07:12.178 UTC' UNION ALL # group 1
SELECT 2, '2018-06-22 00:16:46.456 UTC' UNION ALL # group 1
SELECT 2, '2018-06-22 01:07:42.178 UTC' UNION ALL # group 1
SELECT 2, '2018-06-22 12:51:28.540 UTC' UNION ALL # group 2
SELECT 2, '2018-06-22 13:00:40.486 UTC' UNION ALL # group 2
SELECT 2, '2018-06-22 19:54:30.031 UTC' # group 3
), conversation_groups AS (
SELECT
conversation, dt,
SUM(flag) OVER(PARTITION BY conversation ORDER BY dt) conversation_group
FROM (
SELECT
conversation, dt,
SIGN(IFNULL(TIMESTAMP_DIFF(dt, LAG(dt) OVER(PARTITION BY conversation ORDER BY dt), HOUR), 0)) flag
FROM `project.dataset.table`
)
)
SELECT conversation, AVG(IF(duration = 0, NULL, duration)) avg_duration
FROM (
SELECT
conversation, conversation_group,
TIMESTAMP_DIFF(MAX(dt), MIN(dt), MINUTE) duration
FROM conversation_groups
GROUP BY conversation, conversation_group
)
GROUP BY conversation
ORDER BY conversation
with result as
Row conversation avg_duration
1 2 37.0
Note: you can adjust logic of calculating avg based on your particular needs/vision of how to do so - but the way it is done above is - first duration of each group is calculated and then average of duration in these groups are taken - and note: if there is a duration of zero it is replaced with NULL so it does not affect average calculation. Duration is calculated in MINUTEs, but you can go with SECONDs or whatever you need
Upvotes: 4