Reputation: 253
Assume there are two queries:
select count(distinct a) from x;
select count(*) from (select distinct a from x) y;
I know they return the same results, but from the perspective of Hive
(using MapReduce
). Can anyone please explain which one is the better choice and why?
Any help is appreciated.
Upvotes: 1
Views: 609
Reputation: 1912
2nd Query: select count(*) from (select distinct a from x) y;
is upto 3.x faster than
1st Query: select count(distinct a) from x;
Please refer to https://issues.apache.org/jira/browse/HIVE-10568
Executed both the queries in Hive, first query executed in 1 stage with 1 reducer.
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3 Reduce: 1 Cumulative CPU: 46.51 sec HDFS Read: 42857 HDFS Write: 4 SUCCESS
Total MapReduce CPU Time Spent: 46 seconds 510 msec
Second query executed in 2 stages, with improved parallelism.
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3 Reduce: 1 Cumulative CPU: 13.93 sec HDFS Read: 42857 HDFS Write: 115 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 5.83 sec HDFS Read: 510 HDFS Write: 4 SUCCESS
Total MapReduce CPU Time Spent: 19 seconds 760 msec
Upvotes: 2
Reputation: 38335
In Hive versions prior 1.2.0 the first query executes using one Map and one Reduce stages. Map sends each value to the single reducer, and reducer does all the job.
Single reducer processing too much data in this case.
During second query execution, mappers output distributed between many reducers, each reducer generates it's distinct list and final map-reduce job does summarize the size of each list.
Since Hive 1.2.0 Hive 1.2.0+ provides auto-rewrite optimization hive.optimize.distinct.rewrite=true/false
, see HIVE-10568
See also Hortonworks community
Upvotes: 2