user2895589
user2895589

Reputation: 1030

Clarification on sortby vs order by in hive

I am going through below Hive manual and confused by the details explained on documentation https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

First it says

Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer.

Then it says

Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.

If it already sorts records before sending to reducer then how is the final output not guaranteed to be sorted? is it running dual sort ?

Upvotes: 1

Views: 538

Answers (1)

damientseng
damientseng

Reputation: 553

Most of the logics for sort by and order by are quite similar. You can think of order by as a more restricted case of sort by. Let's suppose the underling execution engine is MapReduce.

Both case rely on the Shuffle phase of MR to sort items. And the shuffle operation can be broken into two parts, each processed by the map side and reduce side of a MR job respectively. The former for local sorting, and the latter for merging those partial results come from different mappers.

When the doc say:

Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer.

It means that rows are sorted by the map side operations of the shuffle phase. Actually this is true for both sort by and order by.

Then what's the difference of the two? It's the parallelism of reducers. For order by, in order to get a globally ordered result set, Hive enforce the number of reducers to be 1, causing all data being sent to a single reducer. And at this single reducer, the merging part of shuffle guarantees that all data are sorted globally.

While for sort by, there's no such enforcement. So the number of reducers can be anything. This leads to data only being sorted within each reducer. But no global sorting is guaranteed. And when the num of reducer is set to 1, explicitly or implicitly, sort by and order by bear the same behavior.

Upvotes: 1

Related Questions