Reputation: 153
Java UDFs return a scalar result. Java UDTFs are not currently supported.
reference
That said, I created a Java UDF as given below
CREATE OR replace function MAP_COUNT(colValue String)
returns OBJECT
language java
handler='Frequency.calculate'
target_path='@~/Frequency.jar'
as
$$
import java.util.HashMap;
import java.util.Map;
import java.util.Optional;
class Frequency {
Map<String, Integer> frequencies = new HashMap<>();
public Map<String, Integer> calculate(String colValue) {
frequencies.putIfAbsent(colValue, 0);
frequencies.computeIfPresent(colValue, (key, value) -> value + 1);
return frequencies;
}
}
$$;
Using MAP_COUNT
UDF in a query as below
with temp_1 as
(
SELECT 'John' AS my_col, 27 as age
UNION ALL
SELECT 'John' AS my_col, 28 as age
UNION ALL
SELECT 'doe' AS my_col, 27 as age
UNION ALL
SELECT 'doe' AS my_col, 28 as age
)
select MAP_COUNT(a.my_col) from temp_1 a;
I get result as below
|MAP_COUNT(A.MY_COL) |
|-------------------------------|
|{ "John": "1" } |
|{ "John": "2" } |
|{ "John": "2", "doe": "1" } |
|{ "John": "2", "doe": "2"} |
The result I expect from my UDF is as below
|MAP_COUNT(A.MY_COL) |
|-------------------------------|
|{ "John": "2", "doe": "2"} |
Is it possible in snowflake?
What if I have query like below?
with temp_1 as
(
SELECT 'John' AS my_col, 27 as age
UNION ALL
SELECT 'John' AS my_col, 28 as age
UNION ALL
SELECT 'doe' AS my_col, 27 as age
UNION ALL
SELECT 'doe' AS my_col, 28 as age
)
select MAP_COUNT(a.my_col) as names, MAP_COUNT(a.age) as ages from temp_1 a;
The result I expect from my UDF is as below
|names ||AGES |
|-------------------------------||-------------------------------|
|{ "John": "2", "doe": "2"} ||{ "27": "2", "28": "2"} |
There are ways to achieve this by simply restructuring query but I want to know if it is possible to do by using MAP_COUNT
function similar to OBJECT_AGG
function in select clause.
Upvotes: 2
Views: 1269
Reputation: 1384
You can achieve much of this desired functionality using Javascript UDTF's today. In particular, a UDTF can be configured to only return a (aggregated) result at the end of a "grouping". Here is an example:
CREATE OR REPLACE FUNCTION MAP_COUNT(COLVALUE varchar)
RETURNS TABLE (FREQUENCIES variant)
LANGUAGE JAVASCRIPT
AS '
{
initialize: function (argumentInfo, context) {
this.freq = new Map();
},
processRow: function (row, rowWriter, context) {
freqVal = this.freq[row.COLVALUE];
this.freq[row.COLVALUE] = (freqVal == undefined ? 1 : 1 + freqVal);
},
finalize: function (rowWriter, context) {
rowWriter.writeRow({FREQUENCIES: this.freq});
}
}
'
;
Create a temp table to test this on:
create or replace temporary table mytemp as
with temp_1 as
(
SELECT 'John' AS my_col, 27 as age
UNION ALL
SELECT 'John' AS my_col, 28 as age
UNION ALL
SELECT 'doe' AS my_col, 27 as age
UNION ALL
SELECT 'doe' AS my_col, 28 as age
)
select * from temp_1;
Run the UDTF as a single partition result:
select agg.* from myTemp,
table(map_count(my_col)) agg
;
Result: { "John": 2, "doe": 2 }
Run the UDTF partitioned by separate groupings:
select my_col, agg.* from myTemp,
table(map_count(age::varchar) over (partition by my_col)) agg
;
Result:
doe { "27": 1, "28": 1 }
John { "27": 1, "28": 1 }
Upvotes: 0
Reputation: 33745
When you run a query that uses a UDF, not all rows will necessarily go to the same instance of the UDF. For example, let's say that you're selecting from a table, and you do:
SELECT MyUdf(x) FROM T
Here T
may have multiple micro-partitions, and the way that it executes is actually similar to:
SELECT MyUdf(x) FROM T_part1 UNION ALL
SELECT MyUdf(x) FROM T_part2 UNION ALL
SELECT MyUdf(x) FROM T_part3 UNION ALL
SELECT MyUdf(x) FROM T_part4
Here there are four separate instances of MyUdf
, and each one sees just a subset of the rows from T
as a whole.
Going back to your example, you're trying to emulate a user-defined aggregate function, where a particular instance of the UDF sees every row. The way to guarantee this would be to aggregate in advance, e.g.:
CREATE OR replace function MAP_COUNT(colValue array)
returns OBJECT
language java
handler='Frequency.calculate'
target_path='@~/Frequency.jar'
as
$$
import java.util.HashMap;
import java.util.Map;
import java.util.Optional;
class Frequency {
public Map<String, Integer> calculate(String[] colValues) {
Map<String, Integer> frequencies = new HashMap<>();
for (String colValue : colValues) {
frequencies.putIfAbsent(colValue, 0);
frequencies.computeIfPresent(colValue, (key, value) -> value + 1);
}
return frequencies;
}
}
$$;
(Note that I changed the UDF and method signatures to use array
and String[]
, respectively.) Now use it in a query:
with temp_1 as
(
SELECT 'John' AS my_col, 27 as age
UNION ALL
SELECT 'John' AS my_col, 28 as age
UNION ALL
SELECT 'doe' AS my_col, 27 as age
UNION ALL
SELECT 'doe' AS my_col, 28 as age
)
select
MAP_COUNT(ARRAY_AGG(a.my_col)) as names,
MAP_COUNT(ARRAY_AGG(a.age)) as ages
from temp_1 a;
This gives me:
names ages
{ "John": "2", "doe": "2" } { "27": "2", "28": "2" }
There are still two problems here, notably:
ARRAY_AGG
.The good news is that both of these problems will be addressed once Java UDAFs are available at some point in the future.
Upvotes: 2