skyzine
skyzine

Reputation: 91

Pass array to Snowflake UDF

My goal is to create a Snowflake UDF that, given an array of values from different columns, returns the maximum value.

This is the function I currently have:

CREATE OR REPLACE FUNCTION get_max(input_array array)
 RETURNS double precision
AS '
  WITH t AS
  (
     SELECT value::integer as val from table(flatten(input => input_array))
     WHERE VAL IS NOT NULL
  ),
  cnt AS
  (
    SELECT COUNT(*) AS c FROM t
  )
  SELECT MAX(val)::float
  FROM 
  (
    SELECT val FROM t
  ) t2
'

When I pass different columns from a table, e.g. select get_max(to_array([table.col1, table.col2, table.col3])) I get the error

Unsupported subquery type cannot be evaluated

However, if I run the sql query only and replace input_array with an array such as array_construct(7, 120, 2, 4, 5, 80) there is no error and the correct value is returned.

  WITH t AS
  (
     SELECT value::integer as val from table(flatten(input => array_construct(2,4,5)))
     WHERE VAL IS NOT NULL
  ),
  cnt AS
  (
    SELECT COUNT(*) AS c FROM t
  )
  SELECT MAX(val)::float
  FROM 
  (
    SELECT val FROM t
  ) t2

Upvotes: 4

Views: 2798

Answers (1)

Felipe Hoffa
Felipe Hoffa

Reputation: 59175

When flattening arrays in a SQL UDF gives you trouble, you can always write a JS, Java, or Python UDF instead.

Here you can see a JS and a Python UDF in action:

CREATE OR REPLACE FUNCTION get_max_from_array_js(input_array array)
RETURNS double precision
language javascript
as 
$$
return Math.max(...INPUT_ARRAY)
$$;

CREATE OR REPLACE FUNCTION get_max_from_array_py(input_array array)
RETURNS double precision
language python
handler = 'x'
runtime_version = 3.8
as 
$$
def x(input_array):
    return max(input_array)
$$;


select get_max_from_array_js([1.1,7.7,2.2,3.3,4.4]);
select get_max_from_array_py([1.1,7.7,2.2,3.3,4.4]);

But given the problem statement, consider using GREATEST in SQL instead:

select greatest(table.col1, table.col2, table.col3)

Performance wise, pure SQL is the best, then JS, then Python:

select current_date()
    , max(greatest(c_customer_sk, c_current_cdemo_sk, c_current_hdemo_sk, c_current_addr_sk, c_first_shipto_date_sk)) m
from snowflake_sample_data.tpcds_sf10tcl.customer
-- 692ms S
-- 155ms 3XL
;

select current_date()
    , max(get_max_from_array_js([c_customer_sk, c_current_cdemo_sk, c_current_hdemo_sk, c_current_addr_sk, c_first_shipto_date_sk])) m
from snowflake_sample_data.tpcds_sf10tcl.customer
where c_customer_sk is not null
and c_current_cdemo_sk is not null
and c_current_hdemo_sk is not null
and c_current_addr_sk is not null
and c_first_shipto_date_sk is not null
-- 15s S
-- 1.2s 3XL
;

select current_date()
    , max(get_max_from_array_py([c_customer_sk, c_current_cdemo_sk, c_current_hdemo_sk, c_current_addr_sk, c_first_shipto_date_sk])) m
from snowflake_sample_data.tpcds_sf10tcl.customer
where c_customer_sk is not null
and c_current_cdemo_sk is not null
and c_current_hdemo_sk is not null
and c_current_addr_sk is not null
and c_first_shipto_date_sk is not null
-- 32s S
-- 4.3s 3XL
;

Upvotes: 1

Related Questions