Chris Phillips
Chris Phillips

Reputation: 12377

How do I trim a header row from files processed by Hadoop's Pig?

I am trying to parse tab separated data files generated by our services using Amazon's Elastic Map Reduce via a Pig program. Things are going well except that all of our data files contain a header row that defines the purpose of each column. Obviously, the (string) headers can't be cast to numeric data values, so I get warnings from Pig like the following:

2011-03-17 22:49:55,378 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.builtin.PigStorage: Unable to interpret value [<snip>] in field being converted to double, caught NumberFormatException <For input string: "headerName"> field discarded

I've got a filter after the load statement that tries to ensure that I don't later operate on any header lines (by filtering out header terms), but I'd like to get rid of the warning noise to avoid masking any potential problems (like actual data fields that don't cast properly).

Is this possible?

Upvotes: 5

Views: 3863

Answers (3)

Manish Agrawal
Manish Agrawal

Reputation: 794

This may help you to get your result:-

input_file = load 'input' using PigStorage(',') as (row1:chararay, row2:chararray);
ranked = rank input_file;
/* ranked:{rank_input_file:long, row1:chararay, row2:chararay} */
NoHeader = filter ranked by (rank_input_file > 1);
New_input_file = foreach NoHeader generate row1, row2;

Upvotes: 0

Dan
Dan

Reputation: 5231

Another option, if you're not comfortable with writing a UDF, could be something like this:

Sample data:

MyIntVal
123
456

Script:

A = load 's3://blah/myFile' USING PigStorage() as (myintval: chararray);

B = filter A by myintval neq 'MyIntVal';

C = foreach B generate (int)$0;

This will filter the header row out, then cast your remaining values to int.

Not saying this is the best way to do it, but it's another option that is pretty simple if it fits your situation.

Upvotes: 3

wlk
wlk

Reputation: 5785

You can do it before submitting Pig job (if possible), or try writing UDF that would emit null values if certain conditions are met, so later You could filter this out.

Upvotes: 0

Related Questions