Not able to filter data using Apache Pig

Question

I am using Hadoop 1.0.3, Pig 0.11.0 on Ubuntu 12.04. In the part-m-00000 file in HDFS the content is as below

training@BigDataVM:~/Installations/hadoop-1.0.3$ bin/hadoop fs -cat /user/training/user/part-m-00000
1,Praveen,20,India,M
2,Prajval,5,India,M
3,Prathibha,15,India,F

I am loading it into a bag and then filtering it as below.

Users1 = load '/user/training/user/part-m-00000' as (user_id, name, age:int, country, gender);
Fltrd = filter Users1 by age <= 16;

But, when I dump the Users1 5 records are shown in the console. But, dumping Fltrd will fetch no records.

dump Fltrd;

The below warning is shown in the Pig console

2013-02-24 16:19:40,735 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 12 time(s).

Looks like I have done some simple mistake, but couldn't figure out what it is. Please help me with this.

Lorand Bendig · Accepted Answer

Since you haven't defined any load function, Pig will use PigStorage in which the default delimiter is ' '.

If part-m-00000 is a textfile then try to set the delimiter to ',' :

Users1 = load '/user/training/user/part-m-00000' using PigStorage(',') 
  as (user_id, name, age:int, country, gender);

If it's a SequenceFile then have a look at Dolan's or my answer on this question.

Not able to filter data using Apache Pig

Answers (1)

Related Questions