Why select distinct partitioned column is very slow?

Question

I hava a table zhihu_answer_increment, it was partitioned by column ym. When I execute query select distinct(ym) from zhihu.zhihu_answer_increment;, it took over 1 min to finish. During the process, hive launched a map-reduce job. here is the log:

INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.59 sec   HDFS Read: 14969 HDFS Write: 106 HDFS EC Read: 0 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 3 seconds 590 msec
INFO  : Completed executing command(queryId=hive_20191015113300_a6f58bad-f35b-4243-890a-a0d9ba9a5210); Time taken: 95.048 seconds
INFO  : OK

In comparison, show partitions zhihu_answer_increment; return the result much faster (just few seconds). But I need to take select distinct(ym) from zhihu.zhihu_answer_increment as a sub-query.

So how can I optimise this query? and I don't understand why it launched a mapreduce job, as far as I'm concerned, only checking the partition directory is enough for this query. Or maybe my consideration is too simple.

Why select distinct partitioned column is very slow?

Answers (1)

Related Questions