Reputation: 1030
I am going through below Hive manual and confused by the details explained on documentation https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
First it says
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer.
Then it says
Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.
If it already sorts records before sending to reducer then how is the final output not guaranteed to be sorted? is it running dual sort ?
Upvotes: 1
Views: 538
Reputation: 553
Most of the logics for sort by
and order by
are quite similar. You can think of order by
as a more restricted case of sort by
. Let's suppose the underling execution engine is MapReduce.
Both case rely on the Shuffle phase of MR to sort items. And the shuffle operation can be broken into two parts, each processed by the map side and reduce side of a MR job respectively. The former for local sorting, and the latter for merging those partial results come from different mappers.
When the doc say:
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer.
It means that rows are sorted by the map side operations of the shuffle phase. Actually this is true for both sort by
and order by
.
Then what's the difference of the two? It's the parallelism of reducers.
For order by
, in order to get a globally ordered result set, Hive enforce the number of reducers to be 1, causing all data being sent to a single reducer. And at this single reducer, the merging part of shuffle guarantees that all data are sorted globally.
While for sort by
, there's no such enforcement. So the number of reducers can be anything. This leads to data only being sorted within each reducer. But no global sorting is guaranteed. And when the num of reducer is set to 1, explicitly or implicitly, sort by
and order by
bear the same behavior.
Upvotes: 1