StrangerThinks
StrangerThinks

Reputation: 248

Hive multiple distinct on query running slow?

Why is the below query taking forever to complete whereas if queried separately returns within seconds?

select count(*),count(distinct a), count(distinct b) from test.tablename;

However, if queried either of the below two queries it returns in seconds. e.g.

select count(*),count(distinct a) from test.tablename; 

or

select count(*),count(distinct b) from test.tablename;

anything to speed up? Already ran a analyze for compute stats. hive configurations are mostly optimized.

UPDATE: apologies. missed the count in the question.

Upvotes: 2

Views: 583

Answers (2)

leftjoin
leftjoin

Reputation: 38290

Alternative approach if you do not have too big counts (too big arrays will cause OOM). size(collect_set()) will give you the distinct count.

select count(*), size(collect_set(a)), size(collect_set(b)) from test.tablename;

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269503

I recall that Hive does a poor job implementing count(distinct). Is it faster like this?

select (select count(*) from test.tablename),
       (select count(*) from (select a from test.tablename group by a) x),
       (select count(*) from (select b from test.tablename group by b) x)

If you have relatively few values of a and b, then this might also have decent performance:

select sum(cnt),
       sum(case when seqnum_a = 1 then 1 else 0 end),
       sum(case when seqnum_b = 1 then 1 else 0 end)
from (select a, b, count(*) as cnt,
             row_number() over (partition by a order by a) as seqnum_a,
             row_number() over (partition by b order by b) as seqnum_b
      from test.tablename
      group by a, b
     ) ab;

And your question is really "why". Hive is a parallel database but some operations are performed on a single processor. Once upon a time, count(distinct) was one of those operations. However, that might have been fixed . . . if there is only one count(distinct) in the query. Multiple such expressions might require sending all the data to a single node to be processed there -- a real performance killer.

Upvotes: 0

Related Questions