How to table partitions by setting a limit on the cumulative sum?

Question

I have a table with some data in BigQuery. I want to find sums of value1, value2 and value3 grouped using sum limitation of sums value3. In example I set sum limit 60.

I know how to sort, filter and find sums in groups but I have no ideas how to group table values like this(step 2). Seems like I need to find cumulative sums row by row for each field and stop process when cumulative sum on value3 exceeds my limit. Then start it again.

How can I make it in BigQuery?

Mikhail Berlyant · Accepted Answer

Below is for BigQuery Standard SQL

From OP's comment - I will use this part as window function. Each window will be 30-200 rows...

Sample data represents just example of one partition/window - so I added id column to play role of partition and thus code can be applied to real use case (just replacing id with actual respective column name)

#standardSQL 
CREATE TEMP FUNCTION partitionBySum(arr ARRAY>, size FLOAT64)
RETURNS ARRAY
LANGUAGE js AS """
  count = parseInt(arr.reduce((a, b) => a + (b.val > size?size:b.val), 0) / size);
  count = (count==0?1:count);
  repeat = true;
  
  while (repeat) {
    output = []; sum = []; repeat = false;
    for (i = 0; i < count; i++) {
      output.push(arr[i].pos + ',' + arr[i].val);
      sum.push(arr[i].val);
    };
    
    for (i = count; i < arr.length; i++) { 
      min_sum = sum[0]; min_index = 0;
      for (j = 0; j < count; j++){
        if(sum[j] < min_sum){
          min_index = j;
          min_sum = sum[j];
        };
      };
      output[min_index] = output[min_index] + ';' + arr[i].pos + ',' + arr[i].val;
      sum[min_index] = sum[min_index] + arr[i].val;
      if(output[min_index].includes(';') && sum[min_index] > size){
        ++count;
        repeat = true;
        break;
      };
    };

  } ;
  return output;
""";

--

WITH `project.dataset.table` AS (
  SELECT 1 id, 6854 value1, 10 value2, 83 value3 UNION ALL
  SELECT 1, 6723, 9, 82 UNION ALL
  SELECT 1, 2234, 203, 49 UNION ALL
  SELECT 1, 456, 1888, 48 UNION ALL
  SELECT 1, 434, 679, 33 UNION ALL
  SELECT 1, 789, 234, 32 UNION ALL
  SELECT 1, 678, 11, 26 UNION ALL
  SELECT 1, 345, 33, 19 UNION ALL
  SELECT 1, 22, 345, 19 UNION ALL
  SELECT 1, 232, 45, 17 UNION ALL
  SELECT 1, 234, 4, 15 UNION ALL
  SELECT 1, 45, 123, 13 UNION ALL
  SELECT 1, 4, 123, 11 UNION ALL
  SELECT 1, 123, 2, 11 UNION ALL
  SELECT 1, 23, 76, 10 UNION ALL
  SELECT 1, 34, 23, 8 UNION ALL
  SELECT 1, 12, 45, 8 UNION ALL
  SELECT 1, 23, 30, 7 UNION ALL
  SELECT 1, 23, 2, 5 UNION ALL
  SELECT 1, 12, 4, 4 
),

--

  data_with_positions as ( 
  -- adding position number to distinguish same values in different rows
  -- for example two 19s and two 11s in sample data
  SELECT *, ROW_NUMBER() OVER(PARTITION BY id) pos
  FROM `project.dataset.table`
), grouped_by_value3 AS (
  -- grouping value3 (along with their respective id, pos) based on summation
  SELECT id, 
    CAST(SPLIT(line)[OFFSET(0)] AS INT64) pos,
    CAST(SPLIT(line)[OFFSET(1)] AS INT64) value3,
    group_id
  FROM (
    SELECT id, ROW_NUMBER() OVER(PARTITION BY id) group_id, grp    
    FROM (
      SELECT id, 
        partitionBySum(ARRAY_AGG(STRUCT(CAST(value3 AS FLOAT64), CAST(pos AS FLOAT64) ) ORDER BY value3 DESC), 60) arr
      FROM data_with_positions GROUP BY id
    ), UNNEST(arr) grp
  ), UNNEST(SPLIT(grp, ';')) line
), all_values_with_groups AS (
  -- join grouping info back to data 
  SELECT id, pos, value1, value2, value3, group_id
  FROM data_with_positions
  JOIN grouped_by_value3 USING(id, pos, value3)
)
SELECT id, group_id,
  STRING_AGG(CAST(value1 AS STRING) ORDER BY value3 DESC) list_values1,
  STRING_AGG(CAST(value2 AS STRING) ORDER BY value3 DESC) list_values2,
  STRING_AGG(CAST(value3 AS STRING) ORDER BY value3 DESC) list_values3,
  SUM(value1) sum_values1,
  SUM(value2) sum_values2,
  SUM(value3) sum_values3,
FROM all_values_with_groups
GROUP BY id, group_id ORDER BY id, group_id

with result

Brief explanation

Actually in above solution there are few distinct logical parts

Part 1: [most complex part] Grouping by sum of value3 - keeping it less or equal to some value (60 in this example)

The main logic is implemented in JS UDF with below logic (just main steps):

calculate initial number of partitions/groups - N
initiate array with N elements from first N rows value3
loop through the rest of value3 and in each iteration add it to the group with min sum till that sum will exceed the limit (with exception when it is just only the value3 in the group)
if sum limit in #3 above is exceeded - increment count - N + 1 and repeat above ##2,3 with new number of groups
While processing ##1-4, all respective positions are kept intact with value3 so then it can be joined back to initial data in Part 2

Result is captured in grouped_by_value3 CTE and looks like below

Part 2: Join grouping info (from part 1) back to main data (note positions where added to main data so data_with_positions CTE is used here)

Result is captured in all_values_with_groups CTE and looks like below

Part 3: Final aggregations - with result shown already on the top of answer :o)

Testing

As a part of light testing - I run this solution with very few cases of dummy data - below is one of such.

WITH `project.dataset.table` AS (
  SELECT 1 id, 6854 value1, 10 value2, 83 value3 UNION ALL
  SELECT 1, 6723, 9, 82 UNION ALL
  SELECT 1, 2234, 203, 49 UNION ALL
  SELECT 1, 456, 1888, 48 UNION ALL
  SELECT 1, 434, 679, 33 UNION ALL
  SELECT 1, 789, 234, 32 UNION ALL
  SELECT 1, 678, 11, 26 UNION ALL
  SELECT 1, 345, 33, 19 UNION ALL
  SELECT 1, 22, 345, 19 UNION ALL
  SELECT 1, 232, 45, 17 UNION ALL
  SELECT 1, 234, 4, 15 UNION ALL
  SELECT 1, 45, 123, 13 UNION ALL
  SELECT 1, 4, 123, 11 UNION ALL
  SELECT 1, 123, 2, 11 UNION ALL
  SELECT 1, 23, 76, 10 UNION ALL
  SELECT 1, 34, 23, 8 UNION ALL
  SELECT 1, 12, 45, 8 UNION ALL
  SELECT 1, 23, 30, 7 UNION ALL
  SELECT 1, 23, 2, 5 UNION ALL
  SELECT 1, 12, 4, 4 UNION ALL
  SELECT 2, 2, 50, 45 UNION ALL
  SELECT 2, 2, 50, 45 UNION ALL
  SELECT 2, 3, 60, 44 UNION ALL
  SELECT 2, 3, 60, 44 UNION ALL
  SELECT 2, 3, 60, 44 UNION ALL
  SELECT 2, 3, 60, 44 UNION ALL
  SELECT 2, 3, 60, 44 UNION ALL
  SELECT 3, 2, 50, 5 UNION ALL
  SELECT 3, 2, 50, 5 UNION ALL
  SELECT 3, 3, 60, 5 UNION ALL
  SELECT 3, 3, 60, 85 UNION ALL
  SELECT 3, 3, 60, 45 UNION ALL
  SELECT 4, 2, 50, 25 UNION ALL
  SELECT 4, 2, 50, 25 UNION ALL
  SELECT 4, 3, 60, 24 UNION ALL
  SELECT 4, 3, 60, 24 UNION ALL
  SELECT 4, 3, 60, 24 UNION ALL
  SELECT 4, 3, 60, 24 UNION ALL
  SELECT 4, 3, 60, 24
)

and below is output

How to table partitions by setting a limit on the cumulative sum?

Answers (1)

Related Questions