Reputation: 386
I have a Python UDF which converts the data from Hex to string. When I try to call the UDF on multiple fields, I am getting an error. Here is my Python UDF. Script is hex_to_str.py
#!/usr/bin/python
@outputSchema("field:chararray")
def hextoStr(field):
if(field!=""):
return field.decode("hex")
I am calling my pig script in below manner.
register file:/home/myuser/myfolder/hex_to_str.py using jython as convert;
data = LOAD '/user/abc/hexfile' using PigStorage(',') as (Name:chararray, age:chararray);
hex = foreach data generate convert.hextoStr(Name),convert.hextoStr(age);
dump hex;
This is the error I am getting while running the script.
INFO org.apache.pig.scripting.jython.JythonFunction - Schema 'field:chararray' defined for func hextoStr
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1108:
<line 2, column 19> Duplicate schema alias: field
The Error at the log file also did not say much.
<line 2, column 19> Duplicate schema alias: field
at org.apache.pig.newplan.logical.visitor.SchemaAliasVisitor.validate(SchemaAliasVisitor.java:75)
at org.apache.pig.newplan.logical.visitor.SchemaAliasVisitor.visit(SchemaAliasVisitor.java:105)
at org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:246)
BUT, If I run the same script on only one field, then it works.
register file:/home/myuser/myfolder/hex_to_str.py using jython as convert;
data = LOAD '/user/abc/hexfile' using PigStorage(',') as (Name:chararray, age:chararray);
hex = foreach data generate Name,convert.hextoStr(age);
dump hex;
Upvotes: 0
Views: 576
Reputation: 15296
I suspect this is because the @outputSchema("field:chararray")
decorator specifies the name (alias) and datatype (by default) of the UDF. When you call it twice, you're using the same alias twice in the GENERATE
, and so the Duplicate schema alias: field
error results.
You could run two separate GENERATE
s, but I suspect you'll be able to use the function twice if you re-alias.
e.g something along these lines:
hex = foreach data generate convert.hextoStr(Name) as field1,convert.hextoStr(age) as field2;
Then each result will have its own alias, and that error should go away. Without re-aliasing, there wouldn't be a way for Pig to differentiate which result you were referring to elsewhere in the GENERATE
statement.
Reponse to comment from OP:
I suspect you could replace field
in the decorator with whatever specific string you'd like, but you would still have the issue of calling it twice on two different fields using the same alias, so you would still need to re-alias. I don't think it is possible to use a variable within the decorator, but re-aliasing back in your Pig script allows for full dynamic-ness. e.g. you could alias them as age
and name
or similar to match the actual field names.
More details on Python UDFs, which says, in part:
Sample Schema String -
y:{t:(word:chararray,num:long)}
, variable names inside a schema string are not used anywhere, they just make the syntax identifiable to the parser.
Upvotes: 1