Hive - Select unique rows based on some columns

Question

I am trying to group rows that have the save value across two columns and have the result ranked/sorted based on a third column.

The result should contain all the other columns.

For the table:

with sample as (
 select 'A' as c1, 'B' as c2, '22:00' as c3, 'Da' as c4
 union all
select  'A' as c1, 'B' as c2, '23:00' as c3, 'Db' as c4
 union all
select  'A' as c1, 'B' as c2, '09:00' as c3, 'Dc' as c4 
  union all
select  'A' as c1, 'C' as c2, '22:00' as c3, 'Dd' as c4
  union all
select  'B' as c1, 'C' as c2, '09:00' as c3, 'De' as c4
)

Grouping or filtering by column c1 and c2 ranked by time on c3, the output would be:

row_number() over (partition by c1, c2 order by c3) as rnk

| c1, c2, c3,  c4, rnk|
----------------------- 
| A | B |09:00| Dc| 1 |
| A | B |22:00| Da| 2 |
| A | B |23:00| Db| 3 |
| A | C |22:00| Dd| 1 |
| B | C |09:00| De| 1 |

All the other columns like c4, c5.. should be kept but don't have any effect on the group criteria or rank.

A believe a window function with partition on c1 and c2 and order by c3 could work, but not sure if it's the best way in case of very large tables and the need to group by more columns.

The final output would be a UNIQUE row where rank is 1 (top). The columns should be exactly the same as the sample table (no rank).

Select * from tableX where rnk = 1 would do the work but keep colum 'rnk'. I want to avoid writing all the columns in the select, excluding the rnk.

| c1, c2, c3,  c4 |
------------------- 
| A | B |09:00| Dc|
| A | C |22:00| Dd|
| B | C |09:00| De|

*Edited, add final table

David דודו Markovitz · Accepted Answer

select  inline(array(rec))

from   (select  struct(*)   as rec

               ,row_number() over 
                (
                    partition by    c1,c2 
                    order by        c3
                ) as rn

        from    sample t
        ) t

where   rn = 1
;

+------+------+-------+------+
| col1 | col2 | col3  | col4 |
+------+------+-------+------+
| A    | B    | 09:00 | Dc   |
| A    | C    | 22:00 | Dd   |
| B    | C    | 09:00 | De   |
+------+------+-------+------+

P.s. Please note that the columns names were aliased, due to the use of struct

Hive - Select unique rows based on some columns

Answers (2)

Related Questions