Canovice
Canovice

Reputation: 10163

BigQuery filter WHERE by date for last 5 rows for each value of categorical column

Apologies if the title is a bit wordy - i will create an example below to highlight what i'm referring to. I have the following table of information:

t1

date        team    num_val
2017-10-04    ab          7  
2017-10-03    ab          6
2017-10-02    ab          8
2017-10-05    ab          3
2017-10-07    ab         12
2017-10-06    ab          3
2017-10-01    ab          5
2017-09-08    cd          4
2017-09-09    cd          8
2017-09-10    cd          2
2017-09-14    cd          1
2017-09-13    cd          5
2017-09-11    cd          6
2017-09-12    cd         13

With this table, I would simply like to:

Simple enough. However, there is no rhyme or reason to the dates for each team (I cannot simply filter on the most recent 5 dates overall, since they may be different for each team). I currently have the following framework for the query:

SELECT
  team, 
  sum(num_val)
FROM t1
GROUP BY team

... any help getting this to the finish line would be greatly appreciated, thanks!!

Upvotes: 0

Views: 1135

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172944

Few more options for BigQuery Standard SQL, so you see different approaches

Option #1

#standardSQL
SELECT team, SUM(num_val) sum_num FROM (
  SELECT team, num_val, ROW_NUMBER() OVER(PARTITION BY team ORDER BY DATE DESC) pos
  FROM `project.dataset.table`
)
WHERE pos <= 5
GROUP BY team

Option #2

#standardSQL
SELECT team, sum_num FROM (
  SELECT team, 
    SUM(num_val) OVER(PARTITION BY team ORDER BY DATE DESC ROWS BETWEEN CURRENT ROW AND 4 FOLLOWING) AS sum_num, 
    ROW_NUMBER() OVER(PARTITION BY team ORDER BY DATE DESC) pos
  FROM `project.dataset.table`
)
WHERE pos = 1  

If to apply to sample data from your question - both produce below result

Row team    sum_num  
1   ab      31   
2   cd      27     

While above options can be useful in some more complicated cases - in your particular case - I would go with option (similar to one) presented in Filipe's answer

#standardSQL
SELECT team, (SELECT SUM(num_val) FROM UNNEST(num_values)) sum_num
FROM (
  SELECT team, ARRAY_AGG(STRUCT(num_val) ORDER BY DATE DESC LIMIT 5) num_values
  FROM `project.dataset.table`
  GROUP BY team
)

Upvotes: 1

Felipe Hoffa
Felipe Hoffa

Reputation: 59165

To get the latest 5 for each:

SELECT team, ARRAY_AGG(num_val ORDER BY date DESC LIMIT 5) arr
FROM x
GROUP BY team

Then UNNEST(arr) and add those num_vals.

SELECT team, (SELECT SUM(num_val) FROM UNNEST(arr) num_val) the_sum
FROM (previous)

Upvotes: 1

Related Questions