Reputation: 869
I have a table with some data in BigQuery. I want to find sums of value1, value2 and value3 grouped using sum limitation of sums value3. In example I set sum limit 60.
I know how to sort, filter and find sums in groups but I have no ideas how to group table values like this(step 2). Seems like I need to find cumulative sums row by row for each field and stop process when cumulative sum on value3 exceeds my limit. Then start it again.
How can I make it in BigQuery?
Upvotes: 1
Views: 609
Reputation: 172994
Below is for BigQuery Standard SQL
From OP's comment - I will use this part as window function. Each window will be 30-200 rows...
Sample data represents just example of one partition/window - so I added id
column to play role of partition and thus code can be applied to real use case (just replacing id
with actual respective column name)
#standardSQL
CREATE TEMP FUNCTION partitionBySum(arr ARRAY<STRUCT<val FLOAT64, pos FLOAT64>>, size FLOAT64)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
count = parseInt(arr.reduce((a, b) => a + (b.val > size?size:b.val), 0) / size);
count = (count==0?1:count);
repeat = true;
while (repeat) {
output = []; sum = []; repeat = false;
for (i = 0; i < count; i++) {
output.push(arr[i].pos + ',' + arr[i].val);
sum.push(arr[i].val);
};
for (i = count; i < arr.length; i++) {
min_sum = sum[0]; min_index = 0;
for (j = 0; j < count; j++){
if(sum[j] < min_sum){
min_index = j;
min_sum = sum[j];
};
};
output[min_index] = output[min_index] + ';' + arr[i].pos + ',' + arr[i].val;
sum[min_index] = sum[min_index] + arr[i].val;
if(output[min_index].includes(';') && sum[min_index] > size){
++count;
repeat = true;
break;
};
};
} ;
return output;
""";
--
WITH `project.dataset.table` AS (
SELECT 1 id, 6854 value1, 10 value2, 83 value3 UNION ALL
SELECT 1, 6723, 9, 82 UNION ALL
SELECT 1, 2234, 203, 49 UNION ALL
SELECT 1, 456, 1888, 48 UNION ALL
SELECT 1, 434, 679, 33 UNION ALL
SELECT 1, 789, 234, 32 UNION ALL
SELECT 1, 678, 11, 26 UNION ALL
SELECT 1, 345, 33, 19 UNION ALL
SELECT 1, 22, 345, 19 UNION ALL
SELECT 1, 232, 45, 17 UNION ALL
SELECT 1, 234, 4, 15 UNION ALL
SELECT 1, 45, 123, 13 UNION ALL
SELECT 1, 4, 123, 11 UNION ALL
SELECT 1, 123, 2, 11 UNION ALL
SELECT 1, 23, 76, 10 UNION ALL
SELECT 1, 34, 23, 8 UNION ALL
SELECT 1, 12, 45, 8 UNION ALL
SELECT 1, 23, 30, 7 UNION ALL
SELECT 1, 23, 2, 5 UNION ALL
SELECT 1, 12, 4, 4
),
--
data_with_positions as (
-- adding position number to distinguish same values in different rows
-- for example two 19s and two 11s in sample data
SELECT *, ROW_NUMBER() OVER(PARTITION BY id) pos
FROM `project.dataset.table`
), grouped_by_value3 AS (
-- grouping value3 (along with their respective id, pos) based on summation
SELECT id,
CAST(SPLIT(line)[OFFSET(0)] AS INT64) pos,
CAST(SPLIT(line)[OFFSET(1)] AS INT64) value3,
group_id
FROM (
SELECT id, ROW_NUMBER() OVER(PARTITION BY id) group_id, grp
FROM (
SELECT id,
partitionBySum(ARRAY_AGG(STRUCT(CAST(value3 AS FLOAT64), CAST(pos AS FLOAT64) ) ORDER BY value3 DESC), 60) arr
FROM data_with_positions GROUP BY id
), UNNEST(arr) grp
), UNNEST(SPLIT(grp, ';')) line
), all_values_with_groups AS (
-- join grouping info back to data
SELECT id, pos, value1, value2, value3, group_id
FROM data_with_positions
JOIN grouped_by_value3 USING(id, pos, value3)
)
SELECT id, group_id,
STRING_AGG(CAST(value1 AS STRING) ORDER BY value3 DESC) list_values1,
STRING_AGG(CAST(value2 AS STRING) ORDER BY value3 DESC) list_values2,
STRING_AGG(CAST(value3 AS STRING) ORDER BY value3 DESC) list_values3,
SUM(value1) sum_values1,
SUM(value2) sum_values2,
SUM(value3) sum_values3,
FROM all_values_with_groups
GROUP BY id, group_id ORDER BY id, group_id
with result
Brief explanation
Actually in above solution there are few distinct logical parts
Part 1: [most complex part] Grouping by sum of value3 - keeping it less or equal to some value (60 in this example)
The main logic is implemented in JS UDF with below logic (just main steps):
N
N + 1
and repeat above ##2,3 with new number of groupsResult is captured in grouped_by_value3
CTE and looks like below
Part 2: Join grouping info (from part 1) back to main data (note positions where added to main data so data_with_positions
CTE is used here)
Result is captured in all_values_with_groups
CTE and looks like below
Part 3: Final aggregations - with result shown already on the top of answer :o)
Testing
As a part of light testing - I run this solution with very few cases of dummy data - below is one of such.
WITH `project.dataset.table` AS (
SELECT 1 id, 6854 value1, 10 value2, 83 value3 UNION ALL
SELECT 1, 6723, 9, 82 UNION ALL
SELECT 1, 2234, 203, 49 UNION ALL
SELECT 1, 456, 1888, 48 UNION ALL
SELECT 1, 434, 679, 33 UNION ALL
SELECT 1, 789, 234, 32 UNION ALL
SELECT 1, 678, 11, 26 UNION ALL
SELECT 1, 345, 33, 19 UNION ALL
SELECT 1, 22, 345, 19 UNION ALL
SELECT 1, 232, 45, 17 UNION ALL
SELECT 1, 234, 4, 15 UNION ALL
SELECT 1, 45, 123, 13 UNION ALL
SELECT 1, 4, 123, 11 UNION ALL
SELECT 1, 123, 2, 11 UNION ALL
SELECT 1, 23, 76, 10 UNION ALL
SELECT 1, 34, 23, 8 UNION ALL
SELECT 1, 12, 45, 8 UNION ALL
SELECT 1, 23, 30, 7 UNION ALL
SELECT 1, 23, 2, 5 UNION ALL
SELECT 1, 12, 4, 4 UNION ALL
SELECT 2, 2, 50, 45 UNION ALL
SELECT 2, 2, 50, 45 UNION ALL
SELECT 2, 3, 60, 44 UNION ALL
SELECT 2, 3, 60, 44 UNION ALL
SELECT 2, 3, 60, 44 UNION ALL
SELECT 2, 3, 60, 44 UNION ALL
SELECT 2, 3, 60, 44 UNION ALL
SELECT 3, 2, 50, 5 UNION ALL
SELECT 3, 2, 50, 5 UNION ALL
SELECT 3, 3, 60, 5 UNION ALL
SELECT 3, 3, 60, 85 UNION ALL
SELECT 3, 3, 60, 45 UNION ALL
SELECT 4, 2, 50, 25 UNION ALL
SELECT 4, 2, 50, 25 UNION ALL
SELECT 4, 3, 60, 24 UNION ALL
SELECT 4, 3, 60, 24 UNION ALL
SELECT 4, 3, 60, 24 UNION ALL
SELECT 4, 3, 60, 24 UNION ALL
SELECT 4, 3, 60, 24
)
and below is output
Upvotes: 2