smaica
smaica

Reputation: 817

SQL window funktion ORDER BY 2 columns, sum over all previous rows in google BigQuery

In google BigQuery, I want to sum(value) over all rows where condition1 is smaller than in the current row AND condition2 is smaller than in the current row. Note that the dataset is ordered by condition1 so all previous rows regarding condition1 are smaller than the current row's value of condition1. Also, I have a very large dataset, so a self-join is not possible with respect to calculation time.

Here is an example:

-- | condition1 | condition2 | value | expected_output |
-- |------------|------------|-------|-----------------|
-- |     1      |      2     |    1  |        0        |
-- |     1      |      2     |    1  |        0        |
-- |     2      |      1     |    1  |        0        |
-- |     2      |      2     |    1  |        0        | 
-- |     2      |      3     |    1  |        2        | --> sum over rows 1 and 2
-- |     3      |      2     |    1  |        1        | --> sum over row 3
-- |     3      |      2     |    1  |        1        | --> sum over row 3
-- |     4      |      5     |    1  |        7        | --> sum over rows 1-7
-- |     5      |      4     |    1  |        7        | --> sum over rows 1-7
-- |     6      |      7     |    1  |        9        | --> sum over rows 1-9

note that in this example value is always 1 to make the example easy, bis value can be any number.

I tried with:

SUM(VALUE) OVER (ORDER BY condition1, condition2 RANGE UNBOUNDED PRECEDING)

but the problem is, that only the first condition condition1 is "checked". It orders by condition2, but still, all rows are considered in the sum, even if the value in condition2 is larger than the value in the current row.

Some query to work with where I already tested the behavior of different windows:

WITH data AS (
  SELECT * 
  FROM UNNEST([
    STRUCT
    (1 as condition1, 2 as condition2,1 as value), 
    (2,1,1),
    (3,2,1),
    (2,2,1),
    (5,4,1),
    (4,5,1),
    (2,3,1),
    (6,7,1),
    (3,2,1)
  ])
)
SELECT
  *
  ,SUM(VALUE) OVER (ORDER BY condition1, condition2 RANGE UNBOUNDED PRECEDING) as test_1
  ,SUM(VALUE) OVER (ORDER BY condition2, condition1 RANGE UNBOUNDED PRECEDING) as test_2
  ,RANK() OVER (ORDER BY condition1, condition2) as rank1
  ,RANK() OVER (ORDER BY condition2, condition1) as rank2
FROM data
ORDER BY condition1

Thanks!

Update: Thanks to @Mikhail Berlyants input I noted that my question was not phrased correctly. I updated the question and the example now.

Upvotes: 1

Views: 1081

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Try below approach - I know it looks little crazy - but just try - here I am trying to partition calculation by condition2 and then join back to original data

WITH temp1 AS (
  SELECT *, ROW_NUMBER() OVER(ORDER BY condition1, condition2) AS pos
  FROM your_table
), temp2 AS (
  SELECT condition2, ARRAY_AGG(STRUCT(pos, condition1, value) ORDER BY condition1) r
  FROM temp1
  GROUP BY condition2
), temp3 AS ( 
  SELECT a.condition1, t1.condition2, SUM(b.value) result, a.pos 
  FROM temp2 t1 LEFT JOIN temp2 t2 ON t1.condition2 > t2.condition2,
  t1.r a JOIN t2.r b ON a.condition1 > b.condition1
  GROUP BY a.pos, condition1, condition2
)
SELECT * EXCEPT(pos) 
FROM temp1 LEFT JOIN temp3 USING(pos, condition1, condition2)    

if applied to sample data in your updated question - output is

enter image description here

Second option to try

WITH temp1 AS (
  SELECT condition1, condition2, SUM(value) value
  FROM your_table
  GROUP BY condition1, condition2
), temp2 AS (
  SELECT condition2, ARRAY_AGG(STRUCT(condition1, value) ORDER BY condition1) r
  FROM temp1
  GROUP BY condition2
), temp3 AS ( 
  SELECT a.condition1, t1.condition2, SUM(b.value) result
  FROM temp2 t1 LEFT JOIN temp2 t2 ON t1.condition2 > t2.condition2,
  t1.r a JOIN t2.r b ON a.condition1 > b.condition1
  GROUP BY condition1, condition2
)
SELECT * 
FROM your_table LEFT JOIN temp3 USING(condition1, condition2)

with same output of course

While second version has better chances to survive it has little limitation vs. first version in this answer - namely - first version very easy to modify for condition mentioned in your question - when condition2 <= versus just < (btw, I don't see now this in your question so i might mixing something here - sorry :o)

Upvotes: 1

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

I think your only option is left join as in below example

select any_value(t1).*, 
  ifnull(sum(t2.value), 0) output 
from your_table t1
left join your_table t2
on t1.ts > t2.ts
and t1.condition1 > t2.condition1
and t1.condition2 > t2.condition2
group by format('%t', t1)           

if apply to sample data in your question - output is

enter image description here

Note: whenever you say about previous rows you must have some column that define the order of rows. hope you have such in your table - it can be date or timestamp, etc. I added ts column as a such

Also note : in row #9 output should be 7 and not 8 as presented in your question!

Upvotes: 1

Related Questions