Big Query SQL: Identify time ranges of minimum length where condition is met

Question

A simplified version of my problem is that I have a table with the following fields: id, timestamp and numeric variable (speed). I need to identify time periods (start and end timestamps) where the average value of the speed is less than a threshold (say 2), but where the time period (end timestamp - start timestamp) is at least a minimum duration (say 5 hours or more). Essentially I need to calculate the mean for an initial 5 hour window and if the mean is less than the threshold retain the start timestamp and step forward one row with the end_timestamp and recalculate the average. If the the new average is less than the threshold hold step forward again, expanding the time window. If the new average is greater than the threshold, report the previous end_timestamp as the end_timestamp for this window, and initiate a new start_timestamp, and calculating a new average looking forward another 5 hours. Ultimately, the end product is a table with a set of start_timestamps, end_timestamps, (and a calculated duration) between which the average speed was less than 2, and the time between the start and the end is at least 5 hours .

I am using Google Big Query: Here is the general structure I have thus far, but it doesn't seem to be working how I would like. First, it only tests and reports the speed threshold for the initial 5 hour window...even if the window grows. Second, it doesn't seem to be growing the window properly. Rarely is the window much longer than 5 hours, despite the fact that when looking at my data in some cases it should be twice as long. I'm hoping that someone has tried to develop a similar analysis and can shed light on where mine is going wrong.

SELECT
*,
LEAD(start_timestamp) OVER (PARTITION BY id ORDER BY timestamp) AS
next_start_timestamp,
LEAD(end_timestamp) OVER (PARTITION BY id ORDER BY timestamp) AS
next_end_timestamp
FROM (
SELECT
*,
IF(last_timestamp IS NULL
  OR timestamp - last_timestamp > 1000000*60*60*5, TRUE, FALSE) AS start_timestamp, #1000000*60*60*5 = 5 hours in microseconds
IF(next_timestamp IS NULL
  OR next_timestamp - timestamp > 1000000*60*60*5, TRUE, FALSE) AS end_timestamp #1000000*60*60*5 = 5 hours in microseconds
FROM (
SELECT
  *,
  LAG(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) last_timestamp,
  LEAD(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) next_timestamp,
FROM (
  SELECT
    *,
    AVG(speed) OVER (PARTITION BY id ORDER BY timestamp RANGE BETWEEN 5 * 60 * 60 * 1000000 PRECEDING AND CURRENT ROW) AS avg_speed_last_period,
  FROM (
      SELECT
        id,
        timestamp,
        speed
      FROM
        [dataset.table1]))
WHERE
  avg_speed_last_period < 2
ORDER BY
  id,
  timestamp)
HAVING
  start_timestamp
  OR end_timestamp)

EDIT: Here is a link to some sample_data. Given this data and the requirement of an average speed less than 2 for at least 5 hours, the first row of the output table would hopefully be

 ID    start_event                   end_event             average_speed    duration_hrs
 203   2015-01-08 17:40:06 UTC    2015-01-09 07:09:35 UTC     0.7802        13.491

 203   2015-01-10 03:43:56 UTC    2015-01-10 08:48:57 UTC     1.452       5.083

Mikhail Berlyant · Accepted Answer

From your CSV, I am assuming below schema

With below data in it:

Having this in mind - below is working code for BigQuery Standard SQL
Does exactly what you expect with output

 id                 start_event                 end_event   average_speed   duration_hrs
203     2015-01-08 17:40:00 UTC   2015-01-09 07:09:00 UTC            0.78          13.48  
203     2015-01-10 03:43:00 UTC   2015-01-10 08:48:00 UTC            1.45           5.08

#standardSQL
CREATE TEMPORARY FUNCTION IdentifyTimeRanges(
  items ARRAY>, 
  min_length INT64, threshold FLOAT64, max_speed FLOAT64
)
RETURNS ARRAY>
LANGUAGE js AS """
  var result = [];
  var initial = 0;
  var candidate = items[initial].ts;
  var len = 0;
  var sum = 0;
  for (i = 0; i < items.length; i++) {
    len++;
    sum += items[i].speed

    if (items[i].ts - candidate < min_length) {
      if (items[i].speed > max_speed) {
        initial = i + 1;
        candidate = items[initial].ts;
        len = 0;
        sum = 0;
      }     
      continue;
    }

    if (sum / len > threshold || items[i].speed > max_speed) {
      avg_speed = (sum - items[i].speed) / (len - 1);
      if (avg_speed <= threshold && items[i - 1].ts - items[initial].ts >= min_length) {
        var o = [];
        o.start_event = items[initial].datetime;
        o.average_speed = avg_speed.toFixed(3);
        o.end_event = items[i - 1].datetime;
        o.duration_hrs = ((items[i - 1].ts - items[initial].ts)/60/60).toFixed(3)
        result.push(o)
      }
      initial = i;
      candidate = items[initial].ts;
      len = 1;
      sum = items[initial].speed;
    }

  };

  return result;
""";

WITH data AS (
  SELECT id, PARSE_TIMESTAMP('%m/%d/%y %H:%M', datetime) AS datetime, speed
  FROM `yourTable`
), compact_data AS (
  SELECT id, ARRAY_AGG(STRUCT(UNIX_SECONDS(datetime), speed, datetime) ORDER BY UNIX_SECONDS(datetime)) AS points
  FROM data
  GROUP BY id
)
SELECT 
  id, start_event, end_event, average_speed, duration_hrs
FROM compact_data, UNNEST(IdentifyTimeRanges(points, 5*60*60, 2, 3.1)) AS segment
ORDER BY id, start_event

Please note: this code is using User-Defined Functions which means some limits, quotas and cost hit for you depends on size of your data

Also have in mind - if datatype of your datetime field is not STRING - the only you need to do is just adjust slightly data subquery - the rest should remain as is!

For example if datetime is of TIMESTAMP data type - your just need replace

  SELECT id, PARSE_TIMESTAMP('%m/%d/%y %H:%M', datetime) AS datetime, speed
  FROM `yourTable`

with

  SELECT id, datetime, speed
  FROM `yourTable`

Hope you enjoy it :o)

Big Query SQL: Identify time ranges of minimum length where condition is met

Answers (1)

Related Questions