Hive - add column with number of distinct values in groups

Question

Suppose I have the following data.

number  group   
1       a       
1       a       
3       a       
4       a       
4       a       
5       c       
6       b       
6       b       
6       b       
7       b       
8       b       
9       b       
10      b       
14      b       
15      b

I would like to group the data by group and add a further column which say how many distinct values of number each group has.

My desired output would look as follows:

number  group   dist_number
1       a       3
1       a       3
3       a       3
4       a       3
4       a       3
5       c       1
6       b       9
6       b       9
6       b       9
7       b       9
8       b       9
9       b       9
10      b       9
14      b       9
15      b       9

What I have tried is:

> select *, count(distinct number) over(partition by group) from numbers;
11      11

As one sees, this aggregates globally and calculates the number of distinct values independently from the group.

One thing I could do is to use group by as follows:

hive> select *, count(distinct number) from numbers group by group;

a       3
b       7
c       1

And then join over group

But I thought maybe there is a more easy solution to this, e.g., using the over(partition by group) method?

o-90 · Accepted Answer

You definitely want to use windowing functions here. I'm not exactly sure how you got 11 11 from the query your tried; I'm 99% sure if you try to count(distinct _) in Hive with an over/partition it will complain. To get around this you can use collect_set() to get an array of the distinct elements in the partition and then you can use size() to count the elements.

Query:

select *
  , size(num_arr) dist_num
from (
  select *
    , collect_set(num) over (partition by grp) num_arr
  from db.tbl ) x

Output:

4   a   [4,3,1] 3
4   a   [4,3,1] 3
3   a   [4,3,1] 3
1   a   [4,3,1] 3
1   a   [4,3,1] 3
15  b   [15,14,10,9,8,7,6]  7
14  b   [15,14,10,9,8,7,6]  7
10  b   [15,14,10,9,8,7,6]  7
9   b   [15,14,10,9,8,7,6]  7
8   b   [15,14,10,9,8,7,6]  7
7   b   [15,14,10,9,8,7,6]  7
6   b   [15,14,10,9,8,7,6]  7
6   b   [15,14,10,9,8,7,6]  7
6   b   [15,14,10,9,8,7,6]  7
5   c   [5] 1

I included in the arrays in the output so you could see what was happening, obviously you can discard them in your query. As as note, doing a self-join here is really a disaster with regards to performance (and it's more lines of code).

Hive - add column with number of distinct values in groups

Answers (2)

Related Questions