Reputation: 41
I have sorted my data by start time, and I want to create a new field that rolls up data that overlap start times from the previous rows start and end time.
More specifically, I want to write logic that, for a given record X, if the start time is somewhere between the start and end time of the previous row, I want to give record X the same value for the new field as that previous row. If the start time happens after the end time of the previous row, it would get a new value for the new field.
Is something like this possible in BigQuery SQL? Was thinking maybe lag or window function, but not quite sure. Below are examples of what the base table looks like and what I want for the final table.
Any insight appreciated!
Upvotes: 4
Views: 8439
Reputation: 173190
Below is for BigQuery Standard SQL
#standardSQL
SELECT recordID, startTime, endTime,
COUNTIF(newRange) OVER(ORDER BY startTime) AS newRecordID
FROM (
SELECT *,
startTime >= MAX(endTime) OVER(ORDER BY startTime ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS newRange
FROM `project.dataset.table`
)
You can test, play with above using sample data from your question as in example below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 recordID, TIME '12:35:00' startTime, TIME '12:50:00' endTime UNION ALL
SELECT 2, '12:46:00', '12:59:00' UNION ALL
SELECT 3, '14:27:00', '16:05:00' UNION ALL
SELECT 4, '15:48:00', '16:35:00' UNION ALL
SELECT 5, '16:18:00', '17:04:00'
)
SELECT recordID, startTime, endTime,
COUNTIF(newRange) OVER(ORDER BY startTime) AS newRecordID
FROM (
SELECT *,
startTime >= MAX(endTime) OVER(ORDER BY startTime ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS newRange
FROM `project.dataset.table`
)
-- ORDER BY startTime
with result
Row recordID startTime endTime newRecordID
1 1 12:35:00 12:50:00 0
2 2 12:46:00 12:59:00 0
3 3 14:27:00 16:05:00 1
4 4 15:48:00 16:35:00 1
5 5 16:18:00 17:04:00 1
Upvotes: 4
Reputation: 1271151
This is a gaps and islands problem. What you want to do is assign a group id to non-intersecting groups. You can calculating the non-intersections using window functions.
A record starts a new group if the cumulative maximum value of the end time, ordered by start time and ending at the previous record, is less than the current end time. The rest is just a cumulative sum to assign a group id.
For your data:
select t.*,
sum(case when prev_endtime >= endtime then 0 else 1 end) over (order by starttime) as group_id
from (select t.*,
max(endtime) over (order by starttime rows between unbounded preceding and 1 preceding) as prev_endtime
from t
) t;
The only potential issue is if two records start at exactly the same time. If this can happen, the logic might need to be slightly more complex.
Upvotes: 0