Reputation: 386
A simplified version of my problem is that I have a table with the following fields: id, timestamp and numeric variable (speed). I need to identify time periods (start and end timestamps) where the average value of the speed is less than a threshold (say 2), but where the time period (end timestamp - start timestamp) is at least a minimum duration (say 5 hours or more). Essentially I need to calculate the mean for an initial 5 hour window and if the mean is less than the threshold retain the start timestamp and step forward one row with the end_timestamp and recalculate the average. If the the new average is less than the threshold hold step forward again, expanding the time window. If the new average is greater than the threshold, report the previous end_timestamp as the end_timestamp for this window, and initiate a new start_timestamp, and calculating a new average looking forward another 5 hours. Ultimately, the end product is a table with a set of start_timestamps, end_timestamps, (and a calculated duration) between which the average speed was less than 2, and the time between the start and the end is at least 5 hours .
I am using Google Big Query: Here is the general structure I have thus far, but it doesn't seem to be working how I would like. First, it only tests and reports the speed threshold for the initial 5 hour window...even if the window grows. Second, it doesn't seem to be growing the window properly. Rarely is the window much longer than 5 hours, despite the fact that when looking at my data in some cases it should be twice as long. I'm hoping that someone has tried to develop a similar analysis and can shed light on where mine is going wrong.
SELECT
*,
LEAD(start_timestamp) OVER (PARTITION BY id ORDER BY timestamp) AS
next_start_timestamp,
LEAD(end_timestamp) OVER (PARTITION BY id ORDER BY timestamp) AS
next_end_timestamp
FROM (
SELECT
*,
IF(last_timestamp IS NULL
OR timestamp - last_timestamp > 1000000*60*60*5, TRUE, FALSE) AS start_timestamp, #1000000*60*60*5 = 5 hours in microseconds
IF(next_timestamp IS NULL
OR next_timestamp - timestamp > 1000000*60*60*5, TRUE, FALSE) AS end_timestamp #1000000*60*60*5 = 5 hours in microseconds
FROM (
SELECT
*,
LAG(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) last_timestamp,
LEAD(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) next_timestamp,
FROM (
SELECT
*,
AVG(speed) OVER (PARTITION BY id ORDER BY timestamp RANGE BETWEEN 5 * 60 * 60 * 1000000 PRECEDING AND CURRENT ROW) AS avg_speed_last_period,
FROM (
SELECT
id,
timestamp,
speed
FROM
[dataset.table1]))
WHERE
avg_speed_last_period < 2
ORDER BY
id,
timestamp)
HAVING
start_timestamp
OR end_timestamp)
EDIT: Here is a link to some sample_data. Given this data and the requirement of an average speed less than 2 for at least 5 hours, the first row of the output table would hopefully be
ID start_event end_event average_speed duration_hrs
203 2015-01-08 17:40:06 UTC 2015-01-09 07:09:35 UTC 0.7802 13.491
203 2015-01-10 03:43:56 UTC 2015-01-10 08:48:57 UTC 1.452 5.083
Upvotes: 3
Views: 1072
Reputation: 172993
From your CSV, I am assuming below schema
With below data in it:
Having this in mind - below is working code for BigQuery Standard SQL
Does exactly what you expect with output
id start_event end_event average_speed duration_hrs
203 2015-01-08 17:40:00 UTC 2015-01-09 07:09:00 UTC 0.78 13.48
203 2015-01-10 03:43:00 UTC 2015-01-10 08:48:00 UTC 1.45 5.08
#standardSQL
CREATE TEMPORARY FUNCTION IdentifyTimeRanges(
items ARRAY<STRUCT<ts INT64, speed FLOAT64, datetime TIMESTAMP>>,
min_length INT64, threshold FLOAT64, max_speed FLOAT64
)
RETURNS ARRAY<STRUCT<start_event TIMESTAMP, end_event TIMESTAMP, average_speed FLOAT64, duration_hrs FLOAT64>>
LANGUAGE js AS """
var result = [];
var initial = 0;
var candidate = items[initial].ts;
var len = 0;
var sum = 0;
for (i = 0; i < items.length; i++) {
len++;
sum += items[i].speed
if (items[i].ts - candidate < min_length) {
if (items[i].speed > max_speed) {
initial = i + 1;
candidate = items[initial].ts;
len = 0;
sum = 0;
}
continue;
}
if (sum / len > threshold || items[i].speed > max_speed) {
avg_speed = (sum - items[i].speed) / (len - 1);
if (avg_speed <= threshold && items[i - 1].ts - items[initial].ts >= min_length) {
var o = [];
o.start_event = items[initial].datetime;
o.average_speed = avg_speed.toFixed(3);
o.end_event = items[i - 1].datetime;
o.duration_hrs = ((items[i - 1].ts - items[initial].ts)/60/60).toFixed(3)
result.push(o)
}
initial = i;
candidate = items[initial].ts;
len = 1;
sum = items[initial].speed;
}
};
return result;
""";
WITH data AS (
SELECT id, PARSE_TIMESTAMP('%m/%d/%y %H:%M', datetime) AS datetime, speed
FROM `yourTable`
), compact_data AS (
SELECT id, ARRAY_AGG(STRUCT<ts INT64, speed FLOAT64, datetime TIMESTAMP>(UNIX_SECONDS(datetime), speed, datetime) ORDER BY UNIX_SECONDS(datetime)) AS points
FROM data
GROUP BY id
)
SELECT
id, start_event, end_event, average_speed, duration_hrs
FROM compact_data, UNNEST(IdentifyTimeRanges(points, 5*60*60, 2, 3.1)) AS segment
ORDER BY id, start_event
Please note: this code is using User-Defined Functions
which means some limits
, quotas
and cost hit
for you depends on size of your data
Also have in mind - if datatype of your datetime field is not STRING - the only you need to do is just adjust slightly data subquery
- the rest should remain as is!
For example if datetime is of TIMESTAMP data type - your just need replace
SELECT id, PARSE_TIMESTAMP('%m/%d/%y %H:%M', datetime) AS datetime, speed
FROM `yourTable`
with
SELECT id, datetime, speed
FROM `yourTable`
Hope you enjoy it :o)
Upvotes: 1