In Hive, which query is better and why?

Question

Assume there are two queries:

select count(distinct a) from x;
select count(*) from (select distinct a from x) y;

I know they return the same results, but from the perspective of Hive (using MapReduce). Can anyone please explain which one is the better choice and why?

Any help is appreciated.

leftjoin · Accepted Answer

In Hive versions prior 1.2.0 the first query executes using one Map and one Reduce stages. Map sends each value to the single reducer, and reducer does all the job.

Single reducer processing too much data in this case.

During second query execution, mappers output distributed between many reducers, each reducer generates it's distinct list and final map-reduce job does summarize the size of each list.

Since Hive 1.2.0 Hive 1.2.0+ provides auto-rewrite optimization hive.optimize.distinct.rewrite=true/false, see HIVE-10568

In Hive, which query is better and why?

Answers (2)

Related Questions