user182944
user182944

Reputation: 8067

Unable to display data using Pig FOREACH

I have this smaple dataset in a txt file (Format: Firstname,Lastname,age,sex):

(Eric,Ack,27,M)
(Jenny,Dicken,27,F)
(Angs,Dicken,28,M)
(Mahima,Mohanty,29,F)

I want to display the age and firstname of employees having age greater than 27. I am stuck after proceeding quite a bit and looking for some pointers:

I am loading this dataset using:

tuple_record = LOAD '~/Documents/Pig_Tuple.txt' AS (details:tuple(firstname:chararray,lastname:chararray,age:int,sex:chararray));

Describing gives me this format:

describe tuple_record
tuple_record: {details: (firstname: chararray,lastname: chararray,age: int,sex: chararray)}

Then I am flattening the record using this:

flatten_tuple_record = FOREACH tuple_record GENERATE FLATTEN(details);

Describing the flattening gives me this:

describe flatten_tuple_record
flatten_tuple_record: {details::firstname: chararray,details::lastname: chararray,details::age: int,details::sex: chararray}

Now I want to filter this based on age:

filter_by_age = FILTER flatten_tuple_record BY age > 27;

Then I am doing a group based on age:

group_by_age = GROUP filter_by_age BY age;

Now for displaying the firstname and age; I tried this but it did not worked:

display_details = FOREACH group_by_age GENERATE group,firstname;

Below is the error message:

2015-02-01 08:39:37,752 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: 
<line 5, column 54> Invalid field projection. Projected field [firstname] does not exist in schema: group:int,filter_by_age:bag{:tuple(details::firstname:chararray,details::lastname:chararray,details::age:int,details::sex:chararray)}

Please guide.

Upvotes: 0

Views: 1847

Answers (1)

Prasad Khode
Prasad Khode

Reputation: 6739

Your pig statements looks good, but after filtering data by age you can directly get the firstname and age as result. Follow the below statements:

tuple_record = LOAD '/user/cloudera/Pig_Tuple.txt' AS (details:tuple(firstname:chararray,lastname:chararray,age:int,sex:chararray));

describe tuple_record;

flatten_tuple_record = FOREACH tuple_record GENERATE FLATTEN(details);

describe flatten_tuple_record;

filter_by_age = FILTER flatten_tuple_record BY age > 27;

details = FOREACH filter_by_age GENERATE firstname, age;

dump details;

Update:

Here we can even skip FLATTEN statement:

tuple_record = LOAD '/user/cloudera/Pig_Tuple.txt' AS (details:tuple(firstname:chararray,lastname:chararray,age:int,sex:chararray));

describe tuple_record;

filter_by_age = FILTER tuple_record BY details.age > 27;

details = FOREACH filter_by_age GENERATE details.firstname, details.age;

dump details;

In both the cases, result will be:

(Angs,28)
(Mahima,29)

Upvotes: 3

Related Questions