Reputation: 31950
I'm trying to figure out a query to run against Google BigQuery data, that will allow me to identify where the gaps exist, in data that only has timestamps (i.e. there's not a start and end date), data is being ordered by that timestamp, and there are non-consecutive ID numbers on rows.
Sample data:
Row localised_sample_date_time
1 2019-03-21T00:00:29
2 2019-03-21T00:01:29
3 2019-03-21T00:02:29
4 2019-03-21T00:04:29
3 2019-03-21T00:05:29
Each row of data contains a timestamp, and each row of data is expected to occur at a predictable time interval (e.g. 10 mins). I'm looking for a query that illustrates where there are data gaps i.e. when the expected '10-minutely' row isn't present.
I want the data to look something like this:
data island
-------------------------------------------------------
[start timestamp of island] - [end timestamp of island]
e.g.
data island
-------------------------------------------------------
2019-03-21T00:00:29 - 2019-03-21T00:02:29
2019-03-21T00:04:29 - 2019-03-21T00:05:29
Logically I'm looking for something like:
This is a good general resource that explains how to identify data gaps and islands - http://www.kodyaz.com/t-sql/data-islands-and-data-gaps-with-boundaries-using-sql.aspx - it doesn't seem to cover data that is structured in the way mine is though.
I'm thinking there must be something I can use that does something with DATEADD , ORDER BY and GROUP BY, to identify the groups of records that conform to the expected pattern of 10 minute intervals, however I'm not seeing how I need to implement what I'm trying to achieve. What's an elegant, simple solution?
Upvotes: 2
Views: 834
Reputation: 172993
Below is for BigQuery Standard SQL
#standardSQL
SELECT CONCAT(
FORMAT_TIMESTAMP('%F %T', MIN(IF(gap, ts, previous_ts))), ' - ', FORMAT_TIMESTAMP('%F %T', MAX(ts))
) data_island
FROM (
SELECT ts, previous_ts, gap, COUNTIF(gap) OVER(ORDER BY ts) grp
FROM (
SELECT ts, LAG(ts) OVER(ORDER BY ts) previous_ts,
IFNULL(TIMESTAMP_DIFF(ts, LAG(ts, 1) OVER(ORDER BY ts), MINUTE), 2) > 1 gap
FROM `project.dataset.table`
)
)
GROUP BY grp
You can test, play with above using sample data from your question as in example below
#standardSQL
WITH `project.dataset.table` AS (
SELECT TIMESTAMP '2019-03-21T00:00:29' ts UNION ALL
SELECT '2019-03-21T00:01:29' UNION ALL
SELECT '2019-03-21T00:02:29' UNION ALL
SELECT '2019-03-21T00:04:29' UNION ALL
SELECT '2019-03-21T00:05:29'
)
SELECT CONCAT(
FORMAT_TIMESTAMP('%F %T', MIN(IF(gap, ts, previous_ts))), ' - ', FORMAT_TIMESTAMP('%F %T', MAX(ts))
) data_island
FROM (
SELECT ts, previous_ts, gap, COUNTIF(gap) OVER(ORDER BY ts) grp
FROM (
SELECT ts, LAG(ts) OVER(ORDER BY ts) previous_ts,
IFNULL(TIMESTAMP_DIFF(ts, LAG(ts, 1) OVER(ORDER BY ts), MINUTE), 2) > 1 gap
FROM `project.dataset.table`
)
)
GROUP BY grp
-- ORDER BY grp
with result
Row data_island
1 2019-03-21 00:00:29 - 2019-03-21 00:02:29
2 2019-03-21 00:04:29 - 2019-03-21 00:05:29
Upvotes: 2