Reputation: 622
I am using BigQuery Standard SQL and need to take a rolling sum of submit_amount
by customer (cust_id
) from my transaction table.
However, I can only include certain transactions in the rolling sum. My conditions are:
paid_date
of the row that may need to be included in the rolling sum IS NULL
or >=
the submit_date
of the row that is being calculatedsubmit_date
of the row that may need to be included in the rolling sum is between 0 and 30 days prior to the submit_date
the row that is being calculatedcust_id
must match current row cust_id
Here is a sample of my data, with the accrued_amount
field that I would like to calculate. I also added a field here just to make more clear which transactions should qualify for the rolling sum in each row.
txn_id | cust_id | submit_date | paid_date | submit_amount | accrued_amount | qual_txn_id ------------------------------------------------------------------------------------------- 1 | 1 | 2020-01-01 | 2020-01-15 | 10 | 10 | 1 2 | 1 | 2020-01-12 | 2020-02-01 | 5 | 15 | 1, 2 3 | 1 | 2020-01-25 | NULL | 2 | 7 | 2, 3 4 | 1 | 2020-02-05 | NULL | 4 | 6 | 3, 4 5 | 1 | 2020-02-06 | NULL | 1 | 7 | 3, 4, 5 6 | 1 | 2020-03-01 | 2020-03-15 | 3 | 8 | 4, 5, 6 7 | 2 | 2020-03-05 | 2020-03-20 | 6 | 6 | 7 8 | 2 | 2020-03-25 | NULL | 2 | 2 | 8
I have worked out criteria #2 and #3, but am not sure how to incorporate #1. Below is my current query:
SELECT
txn_id, cust_id, submit_date, paid_date, submit_amount,
SUM(submit_amount) OVER (
PARTITION BY cust_id
ORDER BY UNIX_DATE(submit_date)
RANGE BETWEEN 29 PRECEDING AND CURRENT ROW)
AS accrued_amount
FROM t;
I tried using CASE WHEN
in place of submit_amount
(inside the SUM()
) to apply the paid_date
criteria, but couldn't get it to compare the right rows to each other.
What is the right way to do this?
Upvotes: 0
Views: 439
Reputation: 173210
Below is for BigQuery Standard SQL
#standardSQL
SELECT * EXCEPT(arr),
(
SELECT SUM(IF(paid_date IS NULL OR paid_date >= submit_date, submit_amount, 0))
FROM UNNEST(arr)
) AS accrued_amount,
(
SELECT STRING_AGG(IF(paid_date IS NULL OR paid_date >= submit_date, CAST(txn_id AS STRING), NULL))
FROM UNNEST(arr)
) AS qual_txn_id
FROM (
SELECT *,
ARRAY_AGG(STRUCT(txn_id, paid_date, submit_amount)) OVER (
PARTITION BY cust_id
ORDER BY UNIX_DATE(submit_date)
RANGE BETWEEN 29 PRECEDING AND CURRENT ROW
) AS arr
FROM `project.dataset.table`
)
If to apply to sample data from your question as in example below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 txn_id, 1 cust_id, DATE '2020-01-01' submit_date, DATE '2020-01-15' paid_date, 10 submit_amount UNION ALL
SELECT 2, 1, '2020-01-12', '2020-02-01', 5, UNION ALL
SELECT 3, 1, '2020-01-25', NULL, 2 UNION ALL
SELECT 4, 1, '2020-02-05', NULL, 4 UNION ALL
SELECT 5, 1, '2020-02-06', NULL, 1 UNION ALL
SELECT 6, 1, '2020-03-01', '2020-03-15', 3 UNION ALL
SELECT 7, 2, '2020-03-05', '2020-03-20', 6 UNION ALL
SELECT 8, 2, '2020-03-25', NULL, 2
)
SELECT * EXCEPT(arr),
(
SELECT SUM(IF(paid_date IS NULL OR paid_date >= submit_date, submit_amount, 0))
FROM UNNEST(arr)
) AS accrued_amount,
(
SELECT STRING_AGG(IF(paid_date IS NULL OR paid_date >= submit_date, CAST(txn_id AS STRING), NULL))
FROM UNNEST(arr)
) AS qual_txn_id
FROM (
SELECT *,
ARRAY_AGG(STRUCT(txn_id, paid_date, submit_amount)) OVER (
PARTITION BY cust_id
ORDER BY UNIX_DATE(submit_date)
RANGE BETWEEN 29 PRECEDING AND CURRENT ROW
) AS arr
FROM `project.dataset.table`
)
result is
Row txn_id cust_id submit_date paid_date submit_amount accrued_amount qual_txn_id
1 1 1 2020-01-01 2020-01-15 10 10 1
2 2 1 2020-01-12 2020-02-01 5 15 1,2
3 3 1 2020-01-25 null 2 7 2,3
4 4 1 2020-02-05 null 4 6 3,4
5 5 1 2020-02-06 null 1 7 3,4,5
6 6 1 2020-03-01 2020-03-15 3 8 4,5,6
7 7 2 2020-03-05 2020-03-20 6 6 7
8 8 2 2020-03-25 null 2 2 8
Also, below is refactored - slightly less verbose version of above one with exactly same output
#standardSQL
SELECT * EXCEPT(arr),
(
SELECT AS STRUCT
SUM(IF(qual, submit_amount, 0)) AS accrued_amount,
STRING_AGG(IF(qual, CAST(txn_id AS STRING), NULL)) AS qual_txn_id
FROM UNNEST(arr), UNNEST([paid_date IS NULL OR paid_date >= submit_date]) qual
).*
FROM (
SELECT *,
ARRAY_AGG(STRUCT(txn_id, paid_date, submit_amount)) OVER (
PARTITION BY cust_id
ORDER BY UNIX_DATE(submit_date)
RANGE BETWEEN 29 PRECEDING AND CURRENT ROW
) AS arr
FROM `project.dataset.table`
)
Upvotes: 2