Scott Schaen
Scott Schaen

Reputation: 111

Sum Distinct by key (or Symmetric Aggregates) function in BigQuery

I'm trying to sum fanned/duplicated values by de-duping on their key. Looker calls this Symmetric Aggregates. I'd like to use a persistent UDF and not lean on subqueries. Looker has a pretty elegant solution that seems like it could be baked into a UDF.

I tried:

CREATE OR REPLACE FUNCTION `fn.sumdistinct`(unique_key ANY TYPE, val_to_sum ANY TYPE) AS (
 COALESCE(ROUND(COALESCE(CAST((SUM(DISTINCT (CAST(ROUND(COALESCE(safe_cast(val_to_sum as float64) ,0)*(1/1000*1.0), 9) AS NUMERIC) + (cast(cast(concat('0x', substr(to_hex(md5(CAST(unique_key  AS STRING))), 1, 15)) as int64) as numeric) * 4294967296 + cast(cast(concat('0x', substr(to_hex(md5(CAST(unique_key  AS STRING))), 16, 8)) as int64) as numeric)) * 0.000000001 )) - SUM(DISTINCT (cast(cast(concat('0x', substr(to_hex(md5(CAST(unique_key  AS STRING))), 1, 15)) as int64) as numeric) * 4294967296 + cast(cast(concat('0x', substr(to_hex(md5(CAST(unique_key  AS STRING))), 16, 8)) as int64) as numeric)) * 0.000000001) )  / (1/1000*1.0) AS FLOAT64), 0), 6), 0)
);

but I get:

Invalid function fn.sumdistinct. Aggregate function SUM not allowed in templated SQL function call

I'm looking for a function that can turn this:

id   val
1    100
2    200
2    200
3    300
3    300
3    300

into:

unique_ids  total_value
3           600

Upvotes: 2

Views: 7003

Answers (4)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173046

Below is for BigQuery Standard SQL

#standardSQL
CREATE TEMP FUNCTION SumDistinct(arr ANY TYPE) AS ((
  SELECT AS STRUCT 
    COUNT(DISTINCT id) unique_ids, 
    SUM(val) total_value
  FROM (
    SELECT ANY_VALUE(t).*
    FROM UNNEST(arr) t
    GROUP BY FORMAT('%t', t)
  )
));
SELECT SumDistinct(ARRAY_AGG(STRUCT(id, val))).*
FROM `project.dataset.data`   

If to apply to sample data from your question - result is

Row unique_ids  total_value  
1   3           600 

Upvotes: 1

saifuddin778
saifuddin778

Reputation: 7287

You can definitely go about doing this without an external GROUP BY:

CREATE OR REPLACE FUNCTION `dataset.sumdistinct` (values array<struct<id int64, val int64>>) as (
  (
    select 
      struct(
       count(distinct x.id) as col1, 
       sum(distinct x.val) as col2
      ) from unnest(values) as x
  )
);

select sumdistinct(array_agg(struct(id as id, val as val))) from `dataset.table`

Upvotes: 1

Yun Zhang
Yun Zhang

Reputation: 5518

Based on comments, seems you want a UDF to be able to call aggregation function inside. Maybe you're looking for User-defined aggregate function, which is not supported in BigQuery but can be done in below form:

The output is not what you expected because a UDF cannot output 2 column as in your example, hope you got the idea that you need to array_agg() the field and your UDF does UNNEST() inside and will be able to use system aggregate functions, like SUM():

CREATE TEMP FUNCTION sumdistinct (unique_key INT64, val_array ARRAY<INT64>) AS (
 (SELECT COALESCE(ROUND(COALESCE(CAST((SUM(DISTINCT (CAST(ROUND(COALESCE(safe_cast(val_to_sum as float64) ,0)*(1/1000*1.0), 9) AS NUMERIC) + (cast(cast(concat('0x', substr(to_hex(md5(CAST(unique_key  AS STRING))), 1, 15)) as int64) as numeric) * 4294967296 + cast(cast(concat('0x', substr(to_hex(md5(CAST(unique_key  AS STRING))), 16, 8)) as int64) as numeric)) * 0.000000001 )) - SUM(DISTINCT (cast(cast(concat('0x', substr(to_hex(md5(CAST(unique_key  AS STRING))), 1, 15)) as int64) as numeric) * 4294967296 + cast(cast(concat('0x', substr(to_hex(md5(CAST(unique_key  AS STRING))), 16, 8)) as int64) as numeric)) * 0.000000001) )  / (1/1000*1.0) AS FLOAT64), 0), 6), 0)
 FROM unnest(val_array) val_to_sum)
);
with data as 
(select 1 as id, 100 as val union all
select 1, 100 union all
select 1, 100 union all
select 2, 200 union all
select 2, 200 union all
select 3, 300 union all
select 3, 300
)
SELECT sumdistinct(id, array_agg(val))
FROM data
GROUP BY id

Upvotes: 0

Yun Zhang
Yun Zhang

Reputation: 5518

Depending on how you want to resolve different val to same id, you can tune aggregation function (max(val)) in below sql:

with data as 
(select 1 as id, 100 as val union all
select 1, 100 union all
select 1, 100 union all
select 2, 200 union all
select 2, 200 union all
select 3, 300 union all
select 3, 300
)
SELECT count(1) as unique_ids, sum(val) as total_value
FROM (
SELECT id, max(val) val
FROM data
GROUP BY id
)

Upvotes: 1

Related Questions