Reputation: 3599
Lets assume that i have a large data set as per below schema layout
id,name,city
---------------
100,Ajay,Chennai
101,John,Bangalore
102,Zach,Chennai
103,Deep,Bangalore
....
...
I have two style of pig code giving me the same output.
Style 1 :
records = load 'user/inputfiles/records.txt' Using PigStorage(',') as (id:int,name:chararray,city:chararray);
records_grp = group records by city;
records_each = foreach records_grp generate group as city,COUNT(records.id) as emp_cnt;
dump records_each;
Style 2 :
records = load 'user/inputfiles/records.txt' Using PigStorage(',') as (id:int,name:chararray,city:chararray);
records_each = foreach (group records by city) generate group as city,COUNT(records.id) as emp_cnt;
dump records_each ;
In second style i used a nested Foreach. Does it style 2 code run faster than style 1 code or not.
I Would like to reduce the total time taken to complete that pig job..
So the Style 2 code achieve that ? Or there is no impact in total time taken?
If somebody confirms me then i can run similar code in my cluster with very large dataset
Upvotes: 0
Views: 97
Reputation: 1691
The solutions will have same costs.
However if records_grp
is not used elsewhere, the version 2 allows you to not declare a variable and your script is shorter.
Upvotes: 1