user2597504
user2597504

Reputation: 1533

Hadoop Pig Filter

I have a input file like this:

481295b2-30c7-4191-8c14-4e513c7e7577,1362974399,56973118825,56950298471,true
67912962-dd84-46fa-84ef-a2fba12c2423,1362974399,56950556676,56982431507,false
cc68e779-4798-405b-8596-c34dfb9b66da,1362974399,56999223677,56998032823,true
37a1cc9b-8846-4cba-91dd-19e85edbab00,1362974399,56954667454,56981867544,false
4116c384-3693-4909-a8cc-19090d418aa5,1362974399,56986027804,56978169216,true

I only need the line which the last filed is "true". So I use the following Pig Latin:

records = LOAD 'test/test.csv' USING PigStorage(',');
A = FILTER records BY $4 'true';
DUMP A;

The problem is the second command, I always get the error:

2013-08-07 16:48:11,505 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 2, column 25>  mismatched input ''true'' expecting SEMI_COLON

Why? I also try "$4 == 'true'" but still doesn't work though. Could anyone tell me how to do this simple thing?

Upvotes: 1

Views: 4430

Answers (1)

mr2ert
mr2ert

Reputation: 5184

How about:

A = FILTER records BY $4 == 'true' ;

Also, if you know how many fields the data will have beforehand, you should give it a schema. Something like:

records = LOAD 'test/test.csv' USING PigStorage(',') 
          AS (val1: chararray, val2: int, val3: int, val4: int, bool: chararray);

Or whatever names/types fit your needs.

Upvotes: 3

Related Questions