Reputation: 9618
Suppose I have the following data.
number group
1 a
1 a
3 a
4 a
4 a
5 c
6 b
6 b
6 b
7 b
8 b
9 b
10 b
14 b
15 b
I would like to group the data by group
and add a further column which say how many distinct values of number
each group has.
My desired output would look as follows:
number group dist_number
1 a 3
1 a 3
3 a 3
4 a 3
4 a 3
5 c 1
6 b 9
6 b 9
6 b 9
7 b 9
8 b 9
9 b 9
10 b 9
14 b 9
15 b 9
What I have tried is:
> select *, count(distinct number) over(partition by group) from numbers;
11 11
As one sees, this aggregates globally and calculates the number of distinct values independently from the group.
One thing I could do is to use group by
as follows:
hive> select *, count(distinct number) from numbers group by group;
a 3
b 7
c 1
And then join over group
But I thought maybe there is a more easy solution to this, e.g., using the over(partition by group)
method?
Upvotes: 1
Views: 1768
Reputation: 379
As per your requirement,this may work:
select number,group1,COUNT(group1) OVER (PARTITION BY group1) as dist_num from table1;
Upvotes: 0
Reputation: 17585
You definitely want to use windowing functions here. I'm not exactly sure how you got 11 11
from the query your tried; I'm 99% sure if you try to count(distinct _)
in Hive with an over/partition
it will complain. To get around this you can use collect_set()
to get an array of the distinct elements in the partition and then you can use size()
to count the elements.
Query:
select *
, size(num_arr) dist_num
from (
select *
, collect_set(num) over (partition by grp) num_arr
from db.tbl ) x
Output:
4 a [4,3,1] 3
4 a [4,3,1] 3
3 a [4,3,1] 3
1 a [4,3,1] 3
1 a [4,3,1] 3
15 b [15,14,10,9,8,7,6] 7
14 b [15,14,10,9,8,7,6] 7
10 b [15,14,10,9,8,7,6] 7
9 b [15,14,10,9,8,7,6] 7
8 b [15,14,10,9,8,7,6] 7
7 b [15,14,10,9,8,7,6] 7
6 b [15,14,10,9,8,7,6] 7
6 b [15,14,10,9,8,7,6] 7
6 b [15,14,10,9,8,7,6] 7
5 c [5] 1
I included in the arrays in the output so you could see what was happening, obviously you can discard them in your query. As as note, doing a self-join here is really a disaster with regards to performance (and it's more lines of code).
Upvotes: 1