Roim
Roim

Reputation: 3066

Use NTILE function over a rolling window in BigQuery (rolling binning / rolling quantile)

I have a simple table with Time column (1 minute frequency) and another column of float value, let's call it column A. My goal is to create a query in BigQuery that does the following:

For each row, look at the window of the preceding 24 hours and the following 24 hours, and get the relevant "bucket number" (as in NTILE(5) func) for the specific row which is the center of the window.

I'm trying to achieve a "rolling binning" - kind of similar to "rolling quantile" (as in pandas Rolling Quantile) but with the bin number. I can do it quite simply in python with pd.DataFrame.rolling.apply and a custom function for binning and return a value, but I want to do it using BigQuery.

I'm having trouble combining both NTILE function and a rolling window.

I was able to create a rolling window (referred to as Analytic function in the docs) and perform a simple aggregation such as average by using

SELECT Time, AVG(A)
OVER (ORDER BY UNIX_SECONDS(Time) ASC RANGE BETWEEN 86400 PRECEDING AND 86400 FOLLOWING) as rolling_avg_24_hours_each_side
FROM `my_project.my_table`

(86400 comes from 3600 seconds in a hour * 24 hours. Is there an alternative to the use of magic numbers?)

using NTILE(5) function over my entire table is quite easy, but I can't use it in my rolling window.

As the docs of Numbering functions states, using a window_frame_clause is not allowed, therefore I can't use the RANGE... statement combined with NTILE.

So my main question is how to combine NTILE function with a rolling window, or are there other solutions to achieve "rolling binning" with a rolling window?

Upvotes: 0

Views: 864

Answers (2)

Roim
Roim

Reputation: 3066

The answer by Gordon Linoff works, and by using his rolling window trick I managed to apply NTILE function directly instead of calculating it "manually" as suggested.

The first part remains the same, just creating the window. Next, we can apply NTILE function over a CROSS JOIN between the tables:

select t.time, t.a, a2 as value, NTILE(5) OVER (PARTITION BY t.time ORDER BY a2) as bin, from t CROSS JOIN unnest(t.window_a) a2

the result is a table with some duplicate rows. For my needs I could just use Min(bin) and DISTINCT eliminate duplicates.

My full query:

with t as (
      select t.time, t.a, array_agg(t2.a) as window_a
      from `my_project.my_table` t join
           `my_project.my_table` t2
           on t2.time >= datetime_add(t.time, interval -24 hour) and
              t2.time < datetime_add(t.time, interval 24 hour)
      group by t.time, t.a
     )
select distinct time, Min(bin) from (select t.time, t.a, a2 as value, NTILE(5) OVER (PARTITION BY t.time ORDER BY a2) as bin, from t CROSS JOIN unnest(t.window_a) a2)

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269873

There is no simple way to do this in BigQuery as far as I know. The following method is not simple but it might work in your case. Start by bringing together the rows that you need for each window:

select t.time, t.a, array_agg(t2.a) as window_a
from `my_project.my_table` t join
     `my_project.my_table` t2
     on t2.time >= date_add(t.time, interval -24 hour) and
        t2.time < date_add(t.time, interval 24 hour)
group by t.time, t.a;

Then calculate the the tile manually:

with t as (
      select t.time, t.a, array_agg(t2.a) as window_a
      from `my_project.my_table` t join
           `my_project.my_table` t2
           on t2.time >= datetime_add(t.time, interval -24 hour) and
              t2.time < datetime_add(t.time, interval 24 hour)
      group by t.time, t.a
     )
select t.*,
       (select ceiling(min(case when t.a = a2 then seqnum end) * 5.0 / max(cnt))
        from (select a2,
                     row_number() over (order by a2) as seqnum,
                     count(*) over () as cnt
              from unnest(t.window_a) a2
             ) w
       ) as rolling_tile
from t;

Upvotes: 1

Related Questions