Simon Breton
Simon Breton

Reputation: 2876

How to group by 7 days moving average's threshold with big query

Here is a sample of the data table I have :

Account     Date        Items
Account1    2018-01-01  249
Account1    2018-01-02  298
Account1    2018-01-03  297
Account1    2018-01-04  263
Account2    2018-02-05  251
Account2    2018-02-12  249
Account2    2018-02-13  298
Account2    2018-03-14  297
Account2    2018-05-23  263
Account2    2018-05-24  251
etc...

I've data for more than 100 accounts. Number of items are available everyday for each account over the last 10 years.

Using the following query I have a list of 7 days moving average value classified in the bucket I've created with the CASE function :

SELECT Account,
CASE
    WHEN max(mov_avg_7d) <1000 THEN "less than 1k"
    WHEN max(mov_avg_7d) >=1000 AND max(mov_avg_7d) <2500 THEN "1k"
    WHEN max(mov_avg_7d) >=2500 AND max(mov_avg_7d) <5000 THEN "2.5K"
    WHEN max(mov_avg_7d) >=5000 AND max(mov_avg_7d) <10000 THEN "5k"
    WHEN max(mov_avg_7d) >=10000 AND max(mov_avg_7d) <20000 THEN "10k"
    WHEN max(mov_avg_7d) >=20000 AND max(mov_avg_7d) <30000 THEN "20k"
    WHEN max(mov_avg_7d) >=30000 AND max(mov_avg_7d) <40000 THEN "30k"
    WHEN max(mov_avg_7d) >=40000 AND max(mov_avg_7d) <50000 THEN "40k"
    WHEN max(mov_avg_7d) >=50000 AND max(mov_avg_7d) <60000 THEN "50k"
    WHEN max(mov_avg_7d) >=60000 AND max(mov_avg_7d) <70000 THEN "60k"
    WHEN max(mov_avg_7d) >=70000 AND max(mov_avg_7d) <90000 THEN "70k"
    WHEN max(mov_avg_7d) >=90000 AND max(mov_avg_7d) <100000 THEN "90k"
    WHEN max(mov_avg_7d) >=100000 THEN "100k"
    ELSE "error"
END AS status
FROM (
  SELECT Account,date,Items,
  AVG(Items) OVER (PARTITION BY Account ORDER BY UNIX_DATE(date) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS mov_avg_7d
  FROM `my_big_table_here` 
)
group by Account

I would like to have the max(mov_avg_7d) for each bucket and for each account. In other words I would like to build the same query but having something like group by Account,status

I'm not really confortable explaining all this. I hope my question is clear.

EDIT :

My output would be something like this :

Account     Status        Items_max
Account1    less than 1k  249
Account1    1k            1500
Account1    2.5K          2400
Account2    less than 1k  133
Account2    30k           25000
Account2    50k           49000
etc...

Upvotes: 0

Views: 636

Answers (1)

Cedric
Cedric

Reputation: 229

I'm assuming you want to get the following result:

Account      Status             Max
Account1     less than 1k       999
Account1     1k                 2499
...
Account2     less than 1k       888
Account2     1k                 2488
...

Also, I'm still new to BigQuery and SQL, so do correct me if the query doesn't work or if there are other ways of solving this problem.

My code is:

CREATE TEMP FUNCTION STATUS(mov_avg_7d FLOAT64) AS (
  CASE 
    WHEN mov_avg_7d <1000 THEN "less than 1k"
    WHEN mov_avg_7d >=1000 AND mov_avg_7d <2500 THEN "1k"
    WHEN mov_avg_7d >=2500 AND mov_avg_7d <5000 THEN "2.5K"
    WHEN mov_avg_7d >=5000 AND mov_avg_7d <10000 THEN "5k"
    WHEN mov_avg_7d >=10000 AND mov_avg_7d <20000 THEN "10k"
    WHEN mov_avg_7d >=20000 AND mov_avg_7d <30000 THEN "20k"
    WHEN mov_avg_7d >=30000 AND mov_avg_7d <40000 THEN "30k"
    WHEN mov_avg_7d >=40000 AND mov_avg_7d <50000 THEN "40k"
    WHEN mov_avg_7d >=50000 AND mov_avg_7d <60000 THEN "50k"
    WHEN mov_avg_7d >=60000 AND mov_avg_7d <70000 THEN "60k"
    WHEN mov_avg_7d >=70000 AND mov_avg_7d <90000 THEN "70k"
    WHEN mov_avg_7d >=90000 AND mov_avg_7d <100000 THEN "90k"
    WHEN mov_avg_7d >=100000 THEN "100k"
    ELSE "error"
  END);
SELECT Account, STATUS(mov_avg_7d) as status, max(mov_avg_7d) as max
FROM (
  SELECT Account,date,Items,
  AVG(Items) OVER (PARTITION BY Account ORDER BY UNIX_DATE(date) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS mov_avg_7d
  FROM `my_big_table_here` 
)
group by Account, status

Hope this helps!


Explanation:

Inner subquery:

SELECT Account,date,Items,
      AVG(Items) OVER (PARTITION BY Account ORDER BY UNIX_DATE(date) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS mov_avg_7d
      FROM `my_big_table_here`

gives you the following result (inner result table):

Account     Date     Items  mov_avg_7d
Account1  01012018    xx       xxx
Account1  02012018    xx       xxx
Account1  03012018    xx       xxx
... 
Account2  01012018    xx       xxx
Account2  02012018    xx       xxx
Account2  03012018    xx       xxx
...

As for the outer subquery:

SELECT Account, STATUS(mov_avg_7d) as status, max(mov_avg_7d) as max
FROM (
  <inner subquery> 
)
group by Account, status

STATUS() is a user defined function that takes the value of mov_avg_7d for each row of the inner result table and put it into the respective category.

Let's separate the SELECT statement to make things easier. Example 1:

SELECT Account, max(mov_avg_7d) as max
...
GROUP BY Account

gives the maximum 7 day rolling average for each account, regardless of the status. ie:

Account       Max
Account1      99199
Account2      82849
...

Then, example 2:

SELECT STATUS(mov_avg_7d) as status, max(mov_avg_7d) as max
...
GROUP BY status

gives the maximum 7 day rolling average for each status, regardless of the account. ie:

  Status         Max
less than 1k     899
    1k           2488
   2.5k          4500
    ...

So if you want to get the maximum 7 day rolling average for each account, each status, you simply SELECT and GROUP BY both account and status.

I hope this explanation helps, I'm not sure how else to explain to make things clearer for you as I'm still quite new to SQL too. Feel free to comment if you have any doubts and we can work on them together! :)

Upvotes: 2

Related Questions