kee
kee

Reputation: 11629

BigQuery: Computing the timestamp diff in time ordered rows in a group

Given a table like this, I would like to compute the time duration of each state before changing to a different state:

id state timestamp
1  1     2018-08-17 10:40:00
1  2     2018-08-17 12:40:00
1  1     2018-08-17 14:40:00
2  1     2018-08-17 09:00:00
2  2     2018-08-17 12:00:00

The output I want is:

id state date       duration
1  1     2018-08-17 2 hours
1  2     2018-08-17 2 hours
1  1     2018-08-17 9 hours 20 minutes (until the end of the day in this case)
2  1     2018-08-17 3 hours
2  2     2018-08-17 12 hours (until the end of the day in this case)

I am not so sure whether this is doable in SQL. I feel like I have to write a UDF against aggregated state and timestamp (grouped by id and ordered by ts) which outputs an array of struct (id, state, date, and duration). This array can be flattened.

Upvotes: 0

Views: 617

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Below is for BigQuery Standard SQL

#standardSQL
SELECT id, state, 
  IFNULL(
    TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY id ORDER BY ts), ts, MINUTE), 
    24*60 - TIMESTAMP_DIFF(ts, TIMESTAMP_TRUNC(ts, DAY), MINUTE)
  ) AS duration_minutes
FROM `project.dataset.table`

You can test, play with above using dummy data from your question:

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id, 1 state, TIMESTAMP('2018-08-17 10:40:00') ts UNION ALL
  SELECT 1, 2, '2018-08-17 12:40:00' UNION ALL
  SELECT 1, 1, '2018-08-17 14:40:00' UNION ALL
  SELECT 2, 1, '2018-08-17 09:00:00' UNION ALL
  SELECT 2, 2, '2018-08-17 12:00:00' 
)
SELECT id, state, 
  IFNULL(
    TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY id ORDER BY ts), ts, MINUTE), 
    24*60 - TIMESTAMP_DIFF(ts, TIMESTAMP_TRUNC(ts, DAY), MINUTE)
  ) AS duration_minutes
FROM `project.dataset.table`
-- ORDER BY id, ts  

with result as below

Row id  state   duration_minutes     
1   1   1        120     
2   1   2        120     
3   1   1        560     
4   2   1        180     
5   2   2        720      

If you need your output formatted exactly the qay you showed in question - use below

#standardSQL
SELECT id, state, ts, duration_minutes,
  FORMAT('%i hours %i minutes', DIV(duration_minutes, 60), MOD(duration_minutes, 60)) duration
FROM (
  SELECT id, state, ts,
    IFNULL(
      TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY id ORDER BY ts), ts, MINUTE), 
      24*60 - TIMESTAMP_DIFF(ts, TIMESTAMP_TRUNC(ts, DAY), MINUTE)
    ) AS duration_minutes
  FROM `project.dataset.table`
)

In this case you output will look like below

Row id  state   ts                        duration_minutes  duration     
1   1   1       2018-08-17 10:40:00 UTC   120               2 hours 0 minutes    
2   1   2       2018-08-17 12:40:00 UTC   120               2 hours 0 minutes    
3   1   1       2018-08-17 14:40:00 UTC   560               9 hours 20 minutes   
4   2   1       2018-08-17 09:00:00 UTC   180               3 hours 0 minutes    
5   2   2       2018-08-17 12:00:00 UTC   720               12 hours 0 minutes   

Sure, you will most likely still need to adjust above to your particular case - but you've got a good start I think

Upvotes: 3

Related Questions