Reputation: 145
I have created a GenericUDF in hive that takes one string argument and returns an array of two strings, something like:
> select normalise("ABC-123");
...
> [ "abc-123", "abc123" ]
The UDF makes a call out via JNI to a C++ program for each row to calculate the return data so it would be preferable to only have to make the call once per input row for performance reasons.
However, I want to be able to take each value from the array and put it into a separate field in the output table. I know I can do:
> select normalise("ABC-123")[0] as first_string, normalise("ABC-123")[1] as second_string;
Will hive call the normalise function twice - once for each time it is used in this statement - or will it see both calls have the same argument and only call it once, cache the output, and use the cache rather than making the call a second time?
If it is going to make two UDF calls per row, what other options are there to use this UDF and put the two strings from the output array into separate columns in an output table? (I don't think INLINE will work here)
The use case for this function will be something like:
a|b
1|ABC-123
2|DEF-456
select a, normalise(b)[0] as first_string, normalise(b)[1] as second_string;
Upvotes: 0
Views: 362
Reputation: 36545
If you want to make sure that the udf is only called once, you could save the results to a temporary table first:
create table tmp as
select a, normalize(b) arr
from mytable;
select a, arr[0] first_string, arr[1] second_string
from tmp;
That said, I probably wouldn't worry about this kind of performance tuning if I were you, in my opinion Hive is best approached with more of a "brute force" state of mind: just write the simplest code to achieve your task and if it's slow, you can always add more nodes to your cluster.
Also, it might be worth considering whether you really need a custom UDF for your task, or whether you can simplify your codebase by using inbuilt Hive functions; in the example you gave:
select lower(b) as first_string, regexp_replace(lower(b), '-', '') as second_string
Upvotes: 1