Pig Optimization on Group by

Question

Lets assume that i have a large data set as per below schema layout

id,name,city
---------------
100,Ajay,Chennai
101,John,Bangalore
102,Zach,Chennai
103,Deep,Bangalore
....
...

I have two style of pig code giving me the same output.

Style 1 :

records = load 'user/inputfiles/records.txt' Using PigStorage(',') as (id:int,name:chararray,city:chararray);
records_grp = group records by city;
records_each = foreach records_grp generate group as city,COUNT(records.id) as emp_cnt;
dump records_each;

Style 2 :

records = load 'user/inputfiles/records.txt' Using PigStorage(',') as (id:int,name:chararray,city:chararray);
records_each = foreach (group records by city) generate group as city,COUNT(records.id) as emp_cnt;
dump records_each ;

In second style i used a nested Foreach. Does it style 2 code run faster than style 1 code or not.

I Would like to reduce the total time taken to complete that pig job..

So the Style 2 code achieve that ? Or there is no impact in total time taken?

If somebody confirms me then i can run similar code in my cluster with very large dataset

glefait · Accepted Answer

The solutions will have same costs.

However if records_grp is not used elsewhere, the version 2 allows you to not declare a variable and your script is shorter.

Pig Optimization on Group by

Answers (1)

Related Questions