Reputation: 41

How can I reference column values from previous rows in BigQuery SQL, in order to perform operations or calculations?

I have sorted my data by start time, and I want to create a new field that rolls up data that overlap start times from the previous rows start and end time.

More specifically, I want to write logic that, for a given record X, if the start time is somewhere between the start and end time of the previous row, I want to give record X the same value for the new field as that previous row. If the start time happens after the end time of the previous row, it would get a new value for the new field.

Is something like this possible in BigQuery SQL? Was thinking maybe lag or window function, but not quite sure. Below are examples of what the base table looks like and what I want for the final table.

Original Table

Desired Output Table

Any insight appreciated!

Upvotes: 4

Answers (2)

Mikhail Berlyant

Reputation: 173190

Below is for BigQuery Standard SQL

#standardSQL
SELECT recordID, startTime, endTime,
  COUNTIF(newRange) OVER(ORDER BY startTime) AS newRecordID
FROM (
  SELECT *, 
    startTime >= MAX(endTime) OVER(ORDER BY startTime ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS newRange
  FROM `project.dataset.table`
)

You can test, play with above using sample data from your question as in example below

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 recordID, TIME '12:35:00' startTime, TIME '12:50:00' endTime UNION ALL
  SELECT 2, '12:46:00', '12:59:00' UNION ALL
  SELECT 3, '14:27:00', '16:05:00' UNION ALL
  SELECT 4, '15:48:00', '16:35:00' UNION ALL
  SELECT 5, '16:18:00', '17:04:00' 
)
SELECT recordID, startTime, endTime,
  COUNTIF(newRange) OVER(ORDER BY startTime) AS newRecordID
FROM (
  SELECT *, 
    startTime >= MAX(endTime) OVER(ORDER BY startTime ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS newRange
  FROM `project.dataset.table`
)
-- ORDER BY startTime

with result

Row recordID    startTime   endTime     newRecordID  
1   1           12:35:00    12:50:00    0    
2   2           12:46:00    12:59:00    0    
3   3           14:27:00    16:05:00    1    
4   4           15:48:00    16:35:00    1    
5   5           16:18:00    17:04:00    1

Upvotes: 4

Gordon Linoff

Reputation: 1271151

This is a gaps and islands problem. What you want to do is assign a group id to non-intersecting groups. You can calculating the non-intersections using window functions.

A record starts a new group if the cumulative maximum value of the end time, ordered by start time and ending at the previous record, is less than the current end time. The rest is just a cumulative sum to assign a group id.

For your data:

select t.*,
       sum(case when prev_endtime >= endtime then 0 else 1 end) over (order by starttime) as group_id
from (select t.*,
             max(endtime) over (order by starttime rows between unbounded preceding and 1 preceding) as prev_endtime
      from t
     ) t;

The only potential issue is if two records start at exactly the same time. If this can happen, the logic might need to be slightly more complex.

Upvotes: 0

How can I reference column values from previous rows in BigQuery SQL, in order to perform operations or calculations?

Answers (2)

Related Questions