Reputation: 23
Can anybody tell me how to calculate moving average in bigquery.
Here is what I need in mysql style.
SELECT T1.id, T1.value_column1, avg(T2.value_column1)
FROM table1 T1
INNER JOIN table1 T2 ON T2.Id BETWEEN T1.Id-19 AND T1.Id
Upvotes: 2
Views: 4601
Reputation: 59165
For an updated and more efficient answer, https://stackoverflow.com/a/24943950/132438.
Check the new LAG() and LEAD() window functions. They allow you to traverse the result set, without the need for a self join.
https://developers.google.com/bigquery/docs/query-reference#windowfunctions
A different option with JOIN EACH (this can get too slow as an extremely large amount of data can get generated in the intermediate steps):
SELECT a.SensorId SensorId, a.Timestamp, AVG(b.Data) AS avg_prev_hour_load
FROM (
SELECT * FROM [io_sensor_data.moscone_io13]
WHERE SensorId = 'XBee_40670EB0/mic') a
JOIN EACH [io_sensor_data.moscone_io13] b
ON a.SensorId = b.SensorId
WHERE b.Timestamp BETWEEN (a.Timestamp - 3600000) AND a.Timestamp
GROUP BY SensorId, a.Timestamp;
(based on Joe Celko's SQL problems)
Upvotes: 2
Reputation: 3172
I have created the following "Times" table:
Table Details: Dim_Periods
Schema
Date TIMESTAMP
Year INTEGER
Month INTEGER
day INTEGER
QUARTER INTEGER
DAYOFWEEK INTEGER
MonthStart TIMESTAMP
MonthEnd TIMESTAMP
WeekStart TIMESTAMP
WeekEnd TIMESTAMP
Back30Days TIMESTAMP -- the date 30 days before "Date"
Back7Days TIMESTAMP -- the date 7 days before "Date"
and I use such query to handle "running sums"
SELECT Date,Count(*) as MovingCNT
FROM
(SELECT Date,
Back7Days
FROM DWH.Dim_Periods
where Date < timestamp(current_date()) AND
Date >= (DATE_ADD (CURRENT_TIMESTAMP(), -5, 'month'))
)P
CROSS JOIN EACH
(SELECT repository_url,repository_created_at
FROM publicdata:samples.github_timeline
) L
WHERE timestamp(repository_created_at)>= Back7Days
AND timestamp(repository_created_at)<= Date
GROUP EACH BY Date
Note that it can be used for "Month to date", Week to Date" "30 days back" etc. aggregations as well. However, performance is not the best and the query can take a while on larger data sets due to the Cartesian join. Hope this helps
Upvotes: 0
Reputation: 26617
You can do the same thing, but since BigQuery only allows join on equality, it takes a little bit more work. Here is an example that will compute the 6 month moving average of birth-weights from the public natality sample.
SELECT
--Convert months-since-year-0 back to year, month
INTEGER(month / 12) as year,
month % 12 as month,
avg
FROM (
SELECT month,
-- Note that this average is the average over all of the data in the
-- last 6 months, not an average over the avg values for the last 6 months.
-- It is easy to compute the latter, if that is what is desired -- just
-- compute the average in the inner select, and take the average of those
-- here.
SUM(total_weight_per_month) / SUM(records_per_month) as avg
FROM (
SELECT
-- Note we use t2.month here since that is what is compared against
-- 6 different t1 months.
t2.month as month,
t1.records_per_month as records_per_month,
t1.total_weight_per_month as total_weight_per_month
FROM (
SELECT month,
COUNT( weight_pounds ) as records_per_month,
SUM( weight_pounds) as total_weight_per_month,
-- This active field is the key that lets us join all of the
-- values against the values in the date subselect.
1 AS active
FROM (
SELECT
-- Convert year and month fields to a single value that
-- has the number of months since year 0. This will allow
-- us to do math on the dates.
year * 12 + month AS month,
weight_pounds
FROM [publicdata:samples.natality]
WHERE weight_pounds > 0)
GROUP BY month) as t1
JOIN
-- We join the weights per month agsint a subselect that contains
-- all months.
(SELECT month, 1 as active
FROM
(SELECT
year * 12 + month AS month,
FROM [publicdata:samples.natality])
GROUP BY month) as t2
ON t1.active = t2.active
-- Here is where we get the moving average -- we basically take the month
-- value from t1 and make it apply for 6 months.
WHERE t1.month >= t2.month && t1.month - 6 < t2.month)
GROUP BY month
ORDER BY month desc)
Upvotes: 4