Lou_Ds
Lou_Ds

Reputation: 551

Hive - Select unique rows based on some columns

I am trying to group rows that have the save value across two columns and have the result ranked/sorted based on a third column.

The result should contain all the other columns.

For the table:

with sample as (
 select 'A' as c1, 'B' as c2, '22:00' as c3, 'Da' as c4
 union all
select  'A' as c1, 'B' as c2, '23:00' as c3, 'Db' as c4
 union all
select  'A' as c1, 'B' as c2, '09:00' as c3, 'Dc' as c4 
  union all
select  'A' as c1, 'C' as c2, '22:00' as c3, 'Dd' as c4
  union all
select  'B' as c1, 'C' as c2, '09:00' as c3, 'De' as c4
)

Grouping or filtering by column c1 and c2 ranked by time on c3, the output would be:

row_number() over (partition by c1, c2 order by c3) as rnk

| c1, c2, c3,  c4, rnk|
----------------------- 
| A | B |09:00| Dc| 1 |
| A | B |22:00| Da| 2 |
| A | B |23:00| Db| 3 |
| A | C |22:00| Dd| 1 |
| B | C |09:00| De| 1 |

All the other columns like c4, c5.. should be kept but don't have any effect on the group criteria or rank.

A believe a window function with partition on c1 and c2 and order by c3 could work, but not sure if it's the best way in case of very large tables and the need to group by more columns.

The final output would be a UNIQUE row where rank is 1 (top). The columns should be exactly the same as the sample table (no rank).

Select * from tableX where rnk = 1 would do the work but keep colum 'rnk'. I want to avoid writing all the columns in the select, excluding the rnk.

| c1, c2, c3,  c4 |
------------------- 
| A | B |09:00| Dc|
| A | C |22:00| Dd|
| B | C |09:00| De|

*Edited, add final table

Upvotes: 1

Views: 2367

Answers (2)

David דודו Markovitz
David דודו Markovitz

Reputation: 44991

select  inline(array(rec))

from   (select  struct(*)   as rec

               ,row_number() over 
                (
                    partition by    c1,c2 
                    order by        c3
                ) as rn

        from    sample t
        ) t

where   rn = 1
;

+------+------+-------+------+
| col1 | col2 | col3  | col4 |
+------+------+-------+------+
| A    | B    | 09:00 | Dc   |
| A    | C    | 22:00 | Dd   |
| B    | C    | 09:00 | De   |
+------+------+-------+------+

P.s. Please note that the columns names were aliased, due to the use of struct

Upvotes: 4

Gordon Linoff
Gordon Linoff

Reputation: 1270873

I think you just want row_number():

select t.*,
       row_number() over (partition by c1, c2 order by c3) as rnk
from sample t;

The question seems to have changed since I answered it -- a rather rude thing to happen. If you want the top ranked column, then use a subquery:

select t.*
from (select t.*,
             row_number() over (partition by c1, c2 order by c3) as rnk
      from sample t
     ) t
where rnk = 1;

This returns one row for each combination of c1/c2 in the data. If you want all rows in the event of ties, then use rank() instead of row_number().

Upvotes: 0

Related Questions