Reputation: 99
Currently I have a simple pig script which reads from a file on a hadoop fs, as
my_input = load 'input_file' as (A, B, C)
and then I have another line of code which needs to manipulate the fields, like for instance convert them to uppercase (as in the Pig UDF tutorial).
I do something like,
manipulated = FOREACH my_input GENERATE myudf.Upper(A, B, C)
Now in my Upper.java
file I know that I can get the value of A, B, C as (assuming they are all String
s)
public String exec(Tuple input) throws IOException
{
//yada yada yada
....
String A = (String) input.get(0);
String B = (String) input.get(1);
String C = (String) input.get(2);
//yada yada yada
....
}
Is there anyway I can get the value of a field by its name? For instance if I need to get like 10 fields, is there no other way than to do input.get(i)
from 0 to 9?
I am new to Pig, so I am interested in knowing why this is the case. Is there something like a tuple.getByFieldName('Field Name')
?
Upvotes: 1
Views: 2539
Reputation: 61
While I agree that function flexibility would be affected if you use field names, technically it is possible to access fields by names.
The trick is to use inputSchema
available through getInputSchema()
and get the mapping between field indexes and names from there. You can also override outputSchema
and build the mapping there, using inputSchema
parameter. Then you would be able to use this mapping in your exec
method.
Upvotes: 2
Reputation: 5801
This is not possible, nor would it be very good design to allow it. Pig field names are like variable names. They allow you to give a memorable name to something that gives you insight into what it means. If you use those names in your UDF, you are forcing every Pig script which uses the UDF to adhere to the same naming scheme. If you decide later that you want to think of your variables a little differently, you can't reflect that in their names because the UDF would not function anymore.
The code that reads data from the input tuple in your UDF is like a function declaration. It establishes how to treat each argument to the function.
If you really want to be able to do this, you can build a map easily enough using the TOMAP
builtin function, and have your UDF read from the map. This greatly hurts the reusability of your UDF for the reasons mentioned above, but it is nevertheless a fairly simple workaround.
Upvotes: 4
Reputation: 256
I don't think you can access field by name. You need a structure similar to map to achieve that. In Pig's context, even though you cannot do it by name you can still rely on position if the input (load)'s schema is properly defined and consistent.
The maximum you can do is to validate type of fields you are ingesting in the UDF.
On the other hand, you can use implement "outputSchema" in your UDF to publish its output by name. UDF Manual
Upvotes: 1