user45392
user45392

Reputation: 622

SQL conditional rolling sum

I am using BigQuery Standard SQL and need to take a rolling sum of submit_amount by customer (cust_id) from my transaction table.

However, I can only include certain transactions in the rolling sum. My conditions are:

  • paid_date of the row that may need to be included in the rolling sum IS NULL or >= the submit_date of the row that is being calculated
  • submit_date of the row that may need to be included in the rolling sum is between 0 and 30 days prior to the submit_date the row that is being calculated
  • cust_id must match current row cust_id
  • Here is a sample of my data, with the accrued_amount field that I would like to calculate. I also added a field here just to make more clear which transactions should qualify for the rolling sum in each row.

    txn_id | cust_id | submit_date | paid_date  | submit_amount | accrued_amount | qual_txn_id
    -------------------------------------------------------------------------------------------
         1 |       1 |  2020-01-01 | 2020-01-15 |            10 |            10  |   1
         2 |       1 |  2020-01-12 | 2020-02-01 |             5 |            15  |   1, 2
         3 |       1 |  2020-01-25 |       NULL |             2 |             7  |   2, 3
         4 |       1 |  2020-02-05 |       NULL |             4 |             6  |   3, 4
         5 |       1 |  2020-02-06 |       NULL |             1 |             7  |   3, 4, 5
         6 |       1 |  2020-03-01 | 2020-03-15 |             3 |             8  |   4, 5, 6
         7 |       2 |  2020-03-05 | 2020-03-20 |             6 |             6  |   7
         8 |       2 |  2020-03-25 |       NULL |             2 |             2  |   8
    

    I have worked out criteria #2 and #3, but am not sure how to incorporate #1. Below is my current query:

    SELECT 
      txn_id, cust_id, submit_date, paid_date, submit_amount,
      SUM(submit_amount) OVER (
        PARTITION BY cust_id
        ORDER BY UNIX_DATE(submit_date)
        RANGE BETWEEN 29 PRECEDING AND CURRENT ROW)
      AS accrued_amount
    FROM t;
    

    I tried using CASE WHEN in place of submit_amount (inside the SUM()) to apply the paid_date criteria, but couldn't get it to compare the right rows to each other.

    What is the right way to do this?

    Upvotes: 0

    Views: 439

    Answers (1)

    Mikhail Berlyant
    Mikhail Berlyant

    Reputation: 173210

    Below is for BigQuery Standard SQL

    #standardSQL
    SELECT * EXCEPT(arr),
        (
          SELECT SUM(IF(paid_date IS NULL OR paid_date >= submit_date, submit_amount, 0))
          FROM UNNEST(arr) 
        ) AS accrued_amount, 
        (
          SELECT STRING_AGG(IF(paid_date IS NULL OR paid_date >= submit_date, CAST(txn_id AS STRING), NULL))
          FROM UNNEST(arr) 
        ) AS qual_txn_id 
    FROM (
      SELECT *,
        ARRAY_AGG(STRUCT(txn_id, paid_date, submit_amount)) OVER (
          PARTITION BY cust_id 
          ORDER BY UNIX_DATE(submit_date)
          RANGE BETWEEN 29 PRECEDING AND CURRENT ROW
        ) AS arr  
      FROM `project.dataset.table`
    )    
    

    If to apply to sample data from your question as in example below

    #standardSQL
    WITH `project.dataset.table` AS (
      SELECT 1 txn_id, 1 cust_id, DATE '2020-01-01' submit_date, DATE '2020-01-15' paid_date, 10 submit_amount UNION ALL
      SELECT 2, 1, '2020-01-12', '2020-02-01', 5, UNION ALL
      SELECT 3, 1, '2020-01-25', NULL, 2 UNION ALL
      SELECT 4, 1, '2020-02-05', NULL, 4 UNION ALL
      SELECT 5, 1, '2020-02-06', NULL, 1 UNION ALL
      SELECT 6, 1, '2020-03-01', '2020-03-15', 3 UNION ALL
      SELECT 7, 2, '2020-03-05', '2020-03-20', 6 UNION ALL
      SELECT 8, 2, '2020-03-25', NULL, 2
    )
    SELECT * EXCEPT(arr),
        (
          SELECT SUM(IF(paid_date IS NULL OR paid_date >= submit_date, submit_amount, 0))
          FROM UNNEST(arr) 
        ) AS accrued_amount, 
        (
          SELECT STRING_AGG(IF(paid_date IS NULL OR paid_date >= submit_date, CAST(txn_id AS STRING), NULL))
          FROM UNNEST(arr) 
        ) AS qual_txn_id 
    FROM (
      SELECT *,
        ARRAY_AGG(STRUCT(txn_id, paid_date, submit_amount)) OVER (
          PARTITION BY cust_id 
          ORDER BY UNIX_DATE(submit_date)
          RANGE BETWEEN 29 PRECEDING AND CURRENT ROW
        ) AS arr  
      FROM `project.dataset.table`
    )    
    

    result is

    Row txn_id  cust_id submit_date paid_date   submit_amount   accrued_amount  qual_txn_id  
    1   1       1       2020-01-01  2020-01-15  10              10              1    
    2   2       1       2020-01-12  2020-02-01  5               15              1,2  
    3   3       1       2020-01-25  null        2               7               2,3  
    4   4       1       2020-02-05  null        4               6               3,4  
    5   5       1       2020-02-06  null        1               7               3,4,5    
    6   6       1       2020-03-01  2020-03-15  3               8               4,5,6    
    7   7       2       2020-03-05  2020-03-20  6               6               7    
    8   8       2       2020-03-25  null        2               2               8   
    

    Also, below is refactored - slightly less verbose version of above one with exactly same output

    #standardSQL
    SELECT * EXCEPT(arr),
        (
          SELECT AS STRUCT 
            SUM(IF(qual, submit_amount, 0)) AS accrued_amount,
            STRING_AGG(IF(qual, CAST(txn_id AS STRING), NULL)) AS qual_txn_id
          FROM UNNEST(arr), UNNEST([paid_date IS NULL OR paid_date >= submit_date]) qual 
        ).* 
    FROM (
      SELECT *,
        ARRAY_AGG(STRUCT(txn_id, paid_date, submit_amount)) OVER (
          PARTITION BY cust_id 
          ORDER BY UNIX_DATE(submit_date)
          RANGE BETWEEN 29 PRECEDING AND CURRENT ROW
        ) AS arr  
      FROM `project.dataset.table`
    ) 
    

    Upvotes: 2

    Related Questions