Reputation: 3066
I have a simple table with Time column (1 minute frequency) and another column of float value, let's call it column A. My goal is to create a query in BigQuery that does the following:
For each row, look at the window of the preceding 24 hours and the following 24 hours, and get the relevant "bucket number" (as in NTILE(5)
func) for the specific row which is the center of the window.
I'm trying to achieve a "rolling binning" - kind of similar to "rolling quantile" (as in pandas Rolling Quantile) but with the bin number. I can do it quite simply in python with pd.DataFrame.rolling.apply
and a custom function for binning and return a value, but I want to do it using BigQuery.
I'm having trouble combining both NTILE function and a rolling window.
I was able to create a rolling window (referred to as Analytic function in the docs) and perform a simple aggregation such as average by using
SELECT Time, AVG(A)
OVER (ORDER BY UNIX_SECONDS(Time) ASC RANGE BETWEEN 86400 PRECEDING AND 86400 FOLLOWING) as rolling_avg_24_hours_each_side
FROM `my_project.my_table`
(86400 comes from 3600 seconds in a hour * 24 hours. Is there an alternative to the use of magic numbers?)
using NTILE(5)
function over my entire table is quite easy, but I can't use it in my rolling window.
As the docs of Numbering functions states, using a window_frame_clause
is not allowed, therefore I can't use the RANGE...
statement combined with NTILE
.
So my main question is how to combine NTILE
function with a rolling window, or are there other solutions to achieve "rolling binning" with a rolling window?
Upvotes: 0
Views: 864
Reputation: 3066
The answer by Gordon Linoff works, and by using his rolling window trick I managed to apply NTILE
function directly instead of calculating it "manually" as suggested.
The first part remains the same, just creating the window.
Next, we can apply NTILE
function over a CROSS JOIN
between the tables:
select t.time, t.a, a2 as value, NTILE(5) OVER (PARTITION BY t.time ORDER BY a2) as bin, from t CROSS JOIN unnest(t.window_a) a2
the result is a table with some duplicate rows. For my needs I could just use Min(bin)
and DISTINCT
eliminate duplicates.
My full query:
with t as (
select t.time, t.a, array_agg(t2.a) as window_a
from `my_project.my_table` t join
`my_project.my_table` t2
on t2.time >= datetime_add(t.time, interval -24 hour) and
t2.time < datetime_add(t.time, interval 24 hour)
group by t.time, t.a
)
select distinct time, Min(bin) from (select t.time, t.a, a2 as value, NTILE(5) OVER (PARTITION BY t.time ORDER BY a2) as bin, from t CROSS JOIN unnest(t.window_a) a2)
Upvotes: 0
Reputation: 1269873
There is no simple way to do this in BigQuery as far as I know. The following method is not simple but it might work in your case. Start by bringing together the rows that you need for each window:
select t.time, t.a, array_agg(t2.a) as window_a
from `my_project.my_table` t join
`my_project.my_table` t2
on t2.time >= date_add(t.time, interval -24 hour) and
t2.time < date_add(t.time, interval 24 hour)
group by t.time, t.a;
Then calculate the the tile manually:
with t as (
select t.time, t.a, array_agg(t2.a) as window_a
from `my_project.my_table` t join
`my_project.my_table` t2
on t2.time >= datetime_add(t.time, interval -24 hour) and
t2.time < datetime_add(t.time, interval 24 hour)
group by t.time, t.a
)
select t.*,
(select ceiling(min(case when t.a = a2 then seqnum end) * 5.0 / max(cnt))
from (select a2,
row_number() over (order by a2) as seqnum,
count(*) over () as cnt
from unnest(t.window_a) a2
) w
) as rolling_tile
from t;
Upvotes: 1