ihadanny
ihadanny

Reputation: 4483

How to sort (order by) big data with hive efficiently?

I want to sort a big dataset efficiently (i.e. with a custom partitioner, like described here: How does the MapReduce sort algorithm work?), but I want to do it with hive.

However, the Hive manual states that "order by" is performed by a single reducer. This surprises me, as pig does implement something similar to the article - pig impl

Am I missing something, or is it that hive simply isn't the right hammer for this job?

Upvotes: 4

Views: 5093

Answers (3)

Thejas Nair
Thejas Nair

Reputation: 241

It is not possible to use multiple reducers for doing total ordering in Hive. It has not been implemented yet - https://issues.apache.org/jira/browse/HIVE-1402 .

It will be easier to use pig instead of writing custom MR job, if you want efficient total ordering.

Upvotes: 1

David Gruzman
David Gruzman

Reputation: 8088

I think that Hive is not right tool for the job. At least for now. It is built to be used as OLAP/Report tool and thereof is not optimized to produce large result datasets, since most of the analytical queries produce relatively small result set. As a result - they have good TOP N capability but not good total order.

Just in case if you didn't encounter it before - I am suggesting to look inte Hadoop's terasort example, which is specifically aimed to sort large dataset in a best possible way using MR. http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/examples/terasort/package-summary.html

Upvotes: 4

Olaf
Olaf

Reputation: 6289

Hive generates MapReduce job(s) for executing the queries. In your particular case the actual sorting is done by the Hadoop MapReduce framework before the data is fed into the reducer.

Upvotes: 0

Related Questions