how to order and limit after group by in pig latin without crashing the job

Question

Many times we are interested in taking the top or bottom of a set (after order by) which has been grouped on certain keys before ordering.

A = FOREACH data
     GENERATE x,y,z;

B = DISTINCT A;
C = GROUP B BY (x,y) PARALLEL 11;
D = FOREACH C {
              ORDERD = ORDER B BY z DESC;
              FIRST_REC = LIMIT ORDERD 1;
              GENERATE FLATTEN(FIRST_REC) AS (x,y,z);
        };

STORE D INTO 'xyz' USING PigStorage();

The foreach generate above takes 'forever' to finish and eventually getting killed after 12 hours or so. The mapreduce job responsible for this say spawned 3maps, 4reducers then 1 reducer remains processing for entire day and eventually kills off due to ERROR 6017, file error.

Is there a way to solve this or a better way of doing what I want to do ?

glefait · Accepted Answer

What is the volume of data involved ? Are you sure that your datanode(s) are big enough to handle that amount of data ?

If so, instead of an ORDER, I will go for a MAX. That way, only one tuple have to be kept in memory and it is sufficient because group already contains all the other needed information:

D = FOREACH C GENERATE group, MAX (B.z);

how to order and limit after group by in pig latin without crashing the job

Answers (1)

Related Questions