er0
er0

Reputation: 1834

How to run an array UDF on a BigQuery column

I would like to process a column in a table as an array using a User Defined Function (UDF) written in Javascript. The function prototype of the UDF looks like this:

CREATE TEMPORARY FUNCTION Summary(items ARRAY<FLOAT64>, seed FLOAT64)
RETURNS ARRAY<FLOAT64>

I would like to turn the output array into an additional column. If the name of the column of my table, which contains FLOAT64 elements, is computed_items then how would I create a new column, item_summaries containing one-value-per-row of the output of the function applied on computed_items.

Upvotes: 0

Views: 256

Answers (1)

Samuel
Samuel

Reputation: 3528

Please always try to make the UDF run on each row and not of the array of all rows. If this is not possible, please check to total amount of rows to not exceed the array cluster size.

Your UDF has an input array items, which are the values of a column. And a seed value. Please add to your UDF an array of row id as input. The output of the UDF cannot be an array as rows, but an array of structure of id and item value. Please see the JS of the UDF for details. The calculation of the new item value is here as an example old value plus seed value.

In case your table has an unique row_number, you can skip the first part and your table is already h1.

Your table is given as tbl and we generate some date. For testing we mix up the order of the table with order by rand(). The helper table adds a row_number as compute_id. In the next steps we need to query this table twice and need to obtain the same mapping of the row_number. This can be archived by making storing this table in a recursive table h1.

The table h2 is using your UDF Summary and as inputs the rows are build up as arrays using the array_agg function and the output array is named Summary_. Next the output Summary_ is unnested and the columns are renamed.

The final table h3 joins the array from the Summary to the table using row_numbers in column compute_id.

create temp function Summary(items ARRAY<FLOAT64>,id ARRAY<int64>, seed FLOAT64)
RETURNS ARRAY<struct<computed_items FLOAT64,id int64>>
language js as 
"""
var calculated_items=[];
for(let i in items){
calculated_items[i]=items[i]+seed;
}
var out=[];
for(let i in calculated_items){
out[i]={computed_items:calculated_items[i],id:id[i]}
}
return out;
""";
with recursive
tbl as (select x, x+1000 as computed_items from unnest(generate_array(0,1000))x order by rand()
),
helper as (select row_number() over () as compute_id, x, computed_items from tbl),
h1 as (select * from helper union all select * from h1 where false),
h2 as (select t0.computed_items, t0.id as compute_id from (select Summary(array_agg(computed_items*1.0),array_agg(compute_id),10000.0) Summary_ from h1) as XX,unnest(XX.Summary_) as t0),
h3 as (select * from h1 left join h2 using (compute_id)) 
select * from h3
order by x

Upvotes: 1

Related Questions