arv100kri
arv100kri

Reputation: 99

Get fields by name in Pig?

Currently I have a simple pig script which reads from a file on a hadoop fs, as

my_input = load 'input_file' as (A, B, C)

and then I have another line of code which needs to manipulate the fields, like for instance convert them to uppercase (as in the Pig UDF tutorial).

I do something like,

manipulated = FOREACH my_input GENERATE myudf.Upper(A, B, C)

Now in my Upper.java file I know that I can get the value of A, B, C as (assuming they are all Strings)

    public String exec(Tuple input) throws IOException
    {
        //yada yada yada
        ....
        String A = (String) input.get(0);
        String B = (String) input.get(1);
        String C = (String) input.get(2);
        //yada yada yada
        ....
   }

Is there anyway I can get the value of a field by its name? For instance if I need to get like 10 fields, is there no other way than to do input.get(i) from 0 to 9?

I am new to Pig, so I am interested in knowing why this is the case. Is there something like a tuple.getByFieldName('Field Name')?

Upvotes: 1

Views: 2539

Answers (3)

AndreyK
AndreyK

Reputation: 61

While I agree that function flexibility would be affected if you use field names, technically it is possible to access fields by names.

The trick is to use inputSchema available through getInputSchema() and get the mapping between field indexes and names from there. You can also override outputSchema and build the mapping there, using inputSchema parameter. Then you would be able to use this mapping in your exec method.

Upvotes: 2

reo katoa
reo katoa

Reputation: 5801

This is not possible, nor would it be very good design to allow it. Pig field names are like variable names. They allow you to give a memorable name to something that gives you insight into what it means. If you use those names in your UDF, you are forcing every Pig script which uses the UDF to adhere to the same naming scheme. If you decide later that you want to think of your variables a little differently, you can't reflect that in their names because the UDF would not function anymore.

The code that reads data from the input tuple in your UDF is like a function declaration. It establishes how to treat each argument to the function.

If you really want to be able to do this, you can build a map easily enough using the TOMAP builtin function, and have your UDF read from the map. This greatly hurts the reusability of your UDF for the reasons mentioned above, but it is nevertheless a fairly simple workaround.

Upvotes: 4

satish
satish

Reputation: 256

I don't think you can access field by name. You need a structure similar to map to achieve that. In Pig's context, even though you cannot do it by name you can still rely on position if the input (load)'s schema is properly defined and consistent.

The maximum you can do is to validate type of fields you are ingesting in the UDF.

On the other hand, you can use implement "outputSchema" in your UDF to publish its output by name. UDF Manual

Upvotes: 1

Related Questions