Chris Halcrow
Chris Halcrow

Reputation: 31950

Identify data gaps in data with only timestamps

I'm trying to figure out a query to run against Google BigQuery data, that will allow me to identify where the gaps exist, in data that only has timestamps (i.e. there's not a start and end date), data is being ordered by that timestamp, and there are non-consecutive ID numbers on rows.

Sample data:

Row     localised_sample_date_time  
1       2019-03-21T00:00:29
2       2019-03-21T00:01:29
3       2019-03-21T00:02:29
4       2019-03-21T00:04:29
3       2019-03-21T00:05:29

Each row of data contains a timestamp, and each row of data is expected to occur at a predictable time interval (e.g. 10 mins). I'm looking for a query that illustrates where there are data gaps i.e. when the expected '10-minutely' row isn't present.

I want the data to look something like this:

data island
-------------------------------------------------------
[start timestamp of island] - [end timestamp of island]

e.g.

data island
-------------------------------------------------------
2019-03-21T00:00:29 - 2019-03-21T00:02:29
2019-03-21T00:04:29 - 2019-03-21T00:05:29

Logically I'm looking for something like:

This is a good general resource that explains how to identify data gaps and islands - http://www.kodyaz.com/t-sql/data-islands-and-data-gaps-with-boundaries-using-sql.aspx - it doesn't seem to cover data that is structured in the way mine is though.

I'm thinking there must be something I can use that does something with DATEADD , ORDER BY and GROUP BY, to identify the groups of records that conform to the expected pattern of 10 minute intervals, however I'm not seeing how I need to implement what I'm trying to achieve. What's an elegant, simple solution?

Upvotes: 2

Views: 834

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Below is for BigQuery Standard SQL

#standardSQL
SELECT CONCAT(
    FORMAT_TIMESTAMP('%F %T', MIN(IF(gap, ts, previous_ts))), ' - ', FORMAT_TIMESTAMP('%F %T', MAX(ts))
  ) data_island
FROM (
  SELECT ts, previous_ts, gap, COUNTIF(gap) OVER(ORDER BY ts) grp
  FROM (
    SELECT ts, LAG(ts) OVER(ORDER BY ts) previous_ts,
      IFNULL(TIMESTAMP_DIFF(ts, LAG(ts, 1) OVER(ORDER BY ts), MINUTE), 2) > 1 gap
    FROM `project.dataset.table` 
  )
)
GROUP BY grp

You can test, play with above using sample data from your question as in example below

#standardSQL
WITH `project.dataset.table` AS (
  SELECT TIMESTAMP '2019-03-21T00:00:29' ts UNION ALL
  SELECT '2019-03-21T00:01:29' UNION ALL
  SELECT '2019-03-21T00:02:29' UNION ALL
  SELECT '2019-03-21T00:04:29' UNION ALL
  SELECT '2019-03-21T00:05:29' 
)
SELECT CONCAT(
    FORMAT_TIMESTAMP('%F %T', MIN(IF(gap, ts, previous_ts))), ' - ', FORMAT_TIMESTAMP('%F %T', MAX(ts))
  ) data_island
FROM (
  SELECT ts, previous_ts, gap, COUNTIF(gap) OVER(ORDER BY ts) grp
  FROM (
    SELECT ts, LAG(ts) OVER(ORDER BY ts) previous_ts,
      IFNULL(TIMESTAMP_DIFF(ts, LAG(ts, 1) OVER(ORDER BY ts), MINUTE), 2) > 1 gap
    FROM `project.dataset.table` 
  )
)
GROUP BY grp
-- ORDER BY grp

with result

Row data_island  
1   2019-03-21 00:00:29 - 2019-03-21 00:02:29    
2   2019-03-21 00:04:29 - 2019-03-21 00:05:29    

Upvotes: 2

Related Questions