agustin
agustin

Reputation: 1351

Count distinct each column in Hive

Given following table:

--------------------------------------------------------------------------------------
| browser (col1)  | os (col2)     | device (col2)  |    ...   |     city (col650)    |       
--------------------------------------------------------------------------------------
| Chrome          | Android       | Samsung        |    ...   | Berlin               |
--------------------------------------------------------------------------------------
| Chrome          | Android       | Samsung        |    ...   | Cologne              |
--------------------------------------------------------------------------------------
| Mozilla         | Android       | Huawei         |    ...   | Munich               |
--------------------------------------------------------------------------------------
| Chrome          | Android       | Sony           |    ...   | Berlin               |
--------------------------------------------------------------------------------------

I would like to get the distinct values of each column:

--------------------------------------------------------------------------------------
| browser (col1)  | os (col2)     | device (col2)  |    ...   |     city (col650)    |       
--------------------------------------------------------------------------------------
| 2               | 1             | 3              |    ...   | 4                    |
--------------------------------------------------------------------------------------

The table has 650 different columns, therefore specifying each column in the query is not possible.

Upvotes: 0

Views: 45

Answers (1)

nobody
nobody

Reputation: 11080

You'll have to do this for all 650 columns.Sum all the row values with rank 1.

 select
         sum(case when col1Rank=1 then 1 ekse 0 end) as col1,
         sum(case when col2Rank=1 then 1 else 0 end) as col2,
         sum(case when col3Rank=1 then 1 else 0 end) as col3

from 
(
    select
         row_number() over(partition by col1 order by col1) as col1Rank,
         row_number() over(partition by col2 order by col2) as col2Rank,
         row_number() over(partition by col3 order by col3) as col3Rank     
    from table_name
) A;

Upvotes: 1

Related Questions