RDK
RDK

Reputation: 153

Create UDAF(not UDTF) in Snowflake

Java UDFs return a scalar result. Java UDTFs are not currently supported.reference

That said, I created a Java UDF as given below

CREATE OR replace function MAP_COUNT(colValue String)
returns OBJECT 
language java
handler='Frequency.calculate'
target_path='@~/Frequency.jar'
as
$$
    import java.util.HashMap;
    import java.util.Map;
    import java.util.Optional;
    class Frequency {
        Map<String, Integer> frequencies = new HashMap<>();

        public Map<String, Integer> calculate(String colValue) {
            frequencies.putIfAbsent(colValue, 0);
            frequencies.computeIfPresent(colValue, (key, value) -> value + 1);
            return frequencies;
        }
    }
$$;

Using MAP_COUNT UDF in a query as below

with temp_1 as
(
    SELECT 'John' AS my_col, 27 as age
    UNION ALL
    SELECT 'John' AS my_col, 28 as age
    UNION ALL
    SELECT 'doe' AS my_col, 27 as age
    UNION ALL
    SELECT 'doe' AS my_col, 28 as age
)
select  MAP_COUNT(a.my_col) from temp_1 a;

I get result as below

|MAP_COUNT(A.MY_COL)            |
|-------------------------------|
|{  "John": "1" }               |
|{  "John": "2" }               |
|{ "John": "2",  "doe": "1" }   |
|{  "John": "2",  "doe": "2"}   |

The result I expect from my UDF is as below

|MAP_COUNT(A.MY_COL)            |
|-------------------------------|
|{  "John": "2",  "doe": "2"}   |

Is it possible in snowflake?

What if I have query like below?

with temp_1 as
(
    SELECT 'John' AS my_col, 27 as age
    UNION ALL
    SELECT 'John' AS my_col, 28 as age
    UNION ALL
    SELECT 'doe' AS my_col, 27 as age
    UNION ALL
    SELECT 'doe' AS my_col, 28 as age
)
select  MAP_COUNT(a.my_col) as names, MAP_COUNT(a.age) as ages  from temp_1 a;

The result I expect from my UDF is as below

|names                          ||AGES                           |
|-------------------------------||-------------------------------|
|{  "John": "2",  "doe": "2"}   ||{  "27": "2",  "28": "2"}      |

There are ways to achieve this by simply restructuring query but I want to know if it is possible to do by using MAP_COUNT function similar to OBJECT_AGG function in select clause.

Upvotes: 2

Views: 1269

Answers (2)

Stuart Ozer
Stuart Ozer

Reputation: 1384

You can achieve much of this desired functionality using Javascript UDTF's today. In particular, a UDTF can be configured to only return a (aggregated) result at the end of a "grouping". Here is an example:

CREATE OR REPLACE FUNCTION MAP_COUNT(COLVALUE varchar)
  RETURNS TABLE (FREQUENCIES variant)
  LANGUAGE JAVASCRIPT
  AS '
  {
    initialize: function (argumentInfo, context) {
        this.freq = new Map();
    },

    processRow: function (row, rowWriter, context) {
        freqVal = this.freq[row.COLVALUE];
        this.freq[row.COLVALUE] = (freqVal == undefined ? 1 : 1 + freqVal);
    },

    finalize: function (rowWriter, context) {
        rowWriter.writeRow({FREQUENCIES: this.freq});
    }
}
'
;

Create a temp table to test this on:

create or replace temporary table mytemp as
with temp_1 as
(
    SELECT 'John' AS my_col, 27 as age
    UNION ALL
    SELECT 'John' AS my_col, 28 as age
    UNION ALL
    SELECT 'doe' AS my_col, 27 as age
    UNION ALL
    SELECT 'doe' AS my_col, 28 as age
)
select * from temp_1;

Run the UDTF as a single partition result:

select agg.* from myTemp,
table(map_count(my_col)) agg
;
    Result:  { "John": 2, "doe": 2 }

Run the UDTF partitioned by separate groupings:

select my_col, agg.* from myTemp,
table(map_count(age::varchar) over (partition by my_col)) agg
;
    Result:
    doe   { "27": 1, "28": 1 }
    John  { "27": 1, "28": 1 }

Upvotes: 0

Elliott Brossard
Elliott Brossard

Reputation: 33745

When you run a query that uses a UDF, not all rows will necessarily go to the same instance of the UDF. For example, let's say that you're selecting from a table, and you do:

SELECT MyUdf(x) FROM T

Here T may have multiple micro-partitions, and the way that it executes is actually similar to:

SELECT MyUdf(x) FROM T_part1 UNION ALL
SELECT MyUdf(x) FROM T_part2 UNION ALL
SELECT MyUdf(x) FROM T_part3 UNION ALL
SELECT MyUdf(x) FROM T_part4

Here there are four separate instances of MyUdf, and each one sees just a subset of the rows from T as a whole.

Going back to your example, you're trying to emulate a user-defined aggregate function, where a particular instance of the UDF sees every row. The way to guarantee this would be to aggregate in advance, e.g.:

CREATE OR replace function MAP_COUNT(colValue array)
returns OBJECT 
language java
handler='Frequency.calculate'
target_path='@~/Frequency.jar'
as
$$
    import java.util.HashMap;
    import java.util.Map;
    import java.util.Optional;
    class Frequency {
        public Map<String, Integer> calculate(String[] colValues) {
            Map<String, Integer> frequencies = new HashMap<>();
            for (String colValue : colValues) {
                frequencies.putIfAbsent(colValue, 0);
                frequencies.computeIfPresent(colValue, (key, value) -> value + 1);
            }
            return frequencies;
        }
    }
$$;

(Note that I changed the UDF and method signatures to use array and String[], respectively.) Now use it in a query:

with temp_1 as
(
    SELECT 'John' AS my_col, 27 as age
    UNION ALL
    SELECT 'John' AS my_col, 28 as age
    UNION ALL
    SELECT 'doe' AS my_col, 27 as age
    UNION ALL
    SELECT 'doe' AS my_col, 28 as age
)
select
  MAP_COUNT(ARRAY_AGG(a.my_col)) as names,
  MAP_COUNT(ARRAY_AGG(a.age)) as ages
from temp_1 a;

This gives me:

names                        ages
{ "John": "2", "doe": "2" }  { "27": "2", "28": "2" }

There are still two problems here, notably:

  • This doesn't scale very well. If the size of either array exceeds 16MB (the maximum value size), the query will fail.
  • The syntax is clunky. Ideally, you'd just use the UDF like any other aggregate function rather than having to wrap your inputs in ARRAY_AGG.

The good news is that both of these problems will be addressed once Java UDAFs are available at some point in the future.

Upvotes: 2

Related Questions