SuperDelta
SuperDelta

Reputation: 253

In Hive, which query is better and why?

Assume there are two queries:

  1. select count(distinct a) from x;

  2. select count(*) from (select distinct a from x) y;

I know they return the same results, but from the perspective of Hive (using MapReduce). Can anyone please explain which one is the better choice and why?

Any help is appreciated.

Upvotes: 1

Views: 609

Answers (2)

Lakshman Battini
Lakshman Battini

Reputation: 1912

2nd Query: select count(*) from (select distinct a from x) y;

is upto 3.x faster than

1st Query: select count(distinct a) from x;

Please refer to https://issues.apache.org/jira/browse/HIVE-10568

Executed both the queries in Hive, first query executed in 1 stage with 1 reducer.

MapReduce Jobs Launched: Stage-Stage-1: Map: 3 Reduce: 1 Cumulative CPU: 46.51 sec HDFS Read: 42857 HDFS Write: 4 SUCCESS Total MapReduce CPU Time Spent: 46 seconds 510 msec

Second query executed in 2 stages, with improved parallelism.

MapReduce Jobs Launched: Stage-Stage-1: Map: 3 Reduce: 1 Cumulative CPU: 13.93 sec HDFS Read: 42857 HDFS Write: 115 SUCCESS Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 5.83 sec HDFS Read: 510 HDFS Write: 4 SUCCESS Total MapReduce CPU Time Spent: 19 seconds 760 msec

Upvotes: 2

leftjoin
leftjoin

Reputation: 38335

In Hive versions prior 1.2.0 the first query executes using one Map and one Reduce stages. Map sends each value to the single reducer, and reducer does all the job. count(distinct)

Single reducer processing too much data in this case.

During second query execution, mappers output distributed between many reducers, each reducer generates it's distinct list and final map-reduce job does summarize the size of each list. count(*) from (select distinct)

Since Hive 1.2.0 Hive 1.2.0+ provides auto-rewrite optimization hive.optimize.distinct.rewrite=true/false, see HIVE-10568

See also Hortonworks community

Upvotes: 2

Related Questions