DebD
DebD

Reputation: 386

Python UDF for PIG Giving error

I have a Python UDF which converts the data from Hex to string. When I try to call the UDF on multiple fields, I am getting an error. Here is my Python UDF. Script is hex_to_str.py

#!/usr/bin/python

@outputSchema("field:chararray")
def hextoStr(field):
if(field!=""):
        return field.decode("hex")

I am calling my pig script in below manner.

register file:/home/myuser/myfolder/hex_to_str.py using jython as convert;
data = LOAD '/user/abc/hexfile' using PigStorage(',') as (Name:chararray, age:chararray);
hex = foreach data generate convert.hextoStr(Name),convert.hextoStr(age);
dump hex;

This is the error I am getting while running the script.

INFO  org.apache.pig.scripting.jython.JythonFunction - Schema 'field:chararray' defined for func hextoStr
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1108:
 <line 2, column 19> Duplicate schema alias: field

The Error at the log file also did not say much.

<line 2, column 19> Duplicate schema alias: field
        at org.apache.pig.newplan.logical.visitor.SchemaAliasVisitor.validate(SchemaAliasVisitor.java:75)
        at org.apache.pig.newplan.logical.visitor.SchemaAliasVisitor.visit(SchemaAliasVisitor.java:105)
        at org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:246)

BUT, If I run the same script on only one field, then it works.

register file:/home/myuser/myfolder/hex_to_str.py using jython as convert;
data = LOAD '/user/abc/hexfile' using PigStorage(',') as (Name:chararray, age:chararray);
hex = foreach data generate Name,convert.hextoStr(age);
dump hex;

Upvotes: 0

Views: 576

Answers (1)

khampson
khampson

Reputation: 15296

I suspect this is because the @outputSchema("field:chararray") decorator specifies the name (alias) and datatype (by default) of the UDF. When you call it twice, you're using the same alias twice in the GENERATE, and so the Duplicate schema alias: field error results.

You could run two separate GENERATEs, but I suspect you'll be able to use the function twice if you re-alias.

e.g something along these lines:

hex = foreach data generate convert.hextoStr(Name) as field1,convert.hextoStr(age) as field2;

Then each result will have its own alias, and that error should go away. Without re-aliasing, there wouldn't be a way for Pig to differentiate which result you were referring to elsewhere in the GENERATE statement.

Reponse to comment from OP:

I suspect you could replace field in the decorator with whatever specific string you'd like, but you would still have the issue of calling it twice on two different fields using the same alias, so you would still need to re-alias. I don't think it is possible to use a variable within the decorator, but re-aliasing back in your Pig script allows for full dynamic-ness. e.g. you could alias them as age and name or similar to match the actual field names.

More details on Python UDFs, which says, in part:

Sample Schema String - y:{t:(word:chararray,num:long)}, variable names inside a schema string are not used anywhere, they just make the syntax identifiable to the parser.

Upvotes: 1

Related Questions