Which operations produce sorted output?

Question

Operations join and group by can be much faster if the arguments are sorted on the key.

They also naturally produce sorted output when the input is sorted.

The question is: does pig guarantee that the output is sorted, or do I need to order by aliases produced by group by ... using 'merge'?

reo katoa · Accepted Answer

Pig offers no guarantees of ordering except following an ORDER BY statement. Since Pig sits on top of Hadoop, it does not directly control how output is created, including its order.

During the shuffle phase, keys are partitioned to each reducer and then sorted by key on each reducer. The result is that if you examine the output of each reducer in turn (i.e., look at the output from reducer 0, then reducer 1, etc.) you will find they are ordered by the map key. In the case of a Pig GROUP BY, the map key is the field you are grouping by. So frequently you will find that the output is sorted the way you want.

The rub is that Pig does not control the underlying map-reduce shuffle and sort phases. So the sort order can vary underneath and Pig does not need to worry about it. I don't know under what conditions the ordering varies -- possibly with different versions of Hadoop -- but you should not rely on it. In general I find the ordering to be lexicographic, which means a GROUP BY on an integer will not be sorted the way you expect. I have also seen output that is sorted first by length, and then lexicographically, which again is likely not what you want.

If you find it works for you in your distribution, then more power to you, you can skip those two MR jobs. But your script may not be portable and may be subject to breakage if you change something about the Hadoop installation.

Which operations produce sorted output?

Answers (1)

Related Questions