Reputation: 1467
Many times we are interested in taking the top or bottom of a set (after order by) which has been grouped on certain keys before ordering.
A = FOREACH data
GENERATE x,y,z;
B = DISTINCT A;
C = GROUP B BY (x,y) PARALLEL 11;
D = FOREACH C {
ORDERD = ORDER B BY z DESC;
FIRST_REC = LIMIT ORDERD 1;
GENERATE FLATTEN(FIRST_REC) AS (x,y,z);
};
STORE D INTO 'xyz' USING PigStorage();
The foreach generate above takes 'forever' to finish and eventually getting killed after 12 hours or so. The mapreduce job responsible for this say spawned 3maps, 4reducers then 1 reducer remains processing for entire day and eventually kills off due to ERROR 6017, file error.
Is there a way to solve this or a better way of doing what I want to do ?
Upvotes: 2
Views: 933
Reputation: 1691
What is the volume of data involved ? Are you sure that your datanode(s) are big enough to handle that amount of data ?
If so, instead of an ORDER, I will go for a MAX. That way, only one tuple have to be kept in memory and it is sufficient because group already contains all the other needed information:
D = FOREACH C GENERATE group, MAX (B.z);
Upvotes: 2