Reputation: 248
Why is the below query taking forever to complete whereas if queried separately returns within seconds?
select count(*),count(distinct a), count(distinct b) from test.tablename;
However, if queried either of the below two queries it returns in seconds. e.g.
select count(*),count(distinct a) from test.tablename;
or
select count(*),count(distinct b) from test.tablename;
anything to speed up? Already ran a analyze for compute stats. hive configurations are mostly optimized.
UPDATE: apologies. missed the count in the question.
Upvotes: 2
Views: 583
Reputation: 38290
Alternative approach if you do not have too big counts (too big arrays will cause OOM). size(collect_set())
will give you the distinct count.
select count(*), size(collect_set(a)), size(collect_set(b)) from test.tablename;
Upvotes: 1
Reputation: 1269503
I recall that Hive does a poor job implementing count(distinct)
. Is it faster like this?
select (select count(*) from test.tablename),
(select count(*) from (select a from test.tablename group by a) x),
(select count(*) from (select b from test.tablename group by b) x)
If you have relatively few values of a and b, then this might also have decent performance:
select sum(cnt),
sum(case when seqnum_a = 1 then 1 else 0 end),
sum(case when seqnum_b = 1 then 1 else 0 end)
from (select a, b, count(*) as cnt,
row_number() over (partition by a order by a) as seqnum_a,
row_number() over (partition by b order by b) as seqnum_b
from test.tablename
group by a, b
) ab;
And your question is really "why". Hive is a parallel database but some operations are performed on a single processor. Once upon a time, count(distinct)
was one of those operations. However, that might have been fixed . . . if there is only one count(distinct)
in the query. Multiple such expressions might require sending all the data to a single node to be processed there -- a real performance killer.
Upvotes: 0