Get first record from each groups without using ROW_NUMBER() or RANK()

Question

Due Hive software constraints, I don't have the luxury of using RANK(), PARTITION BY, and OVER().

What I want to do is retrieve the first row from each group.

The table schema is as follows:

# Field, Type, Null, Key, Default, Extra
'prod_id', 'int(11)', 'NO', 'PRI', NULL, ''
'brand', 'varchar(20)', 'YES', '', NULL, ''
'name', 'varchar(75)', 'YES', '', NULL, ''
'price', 'int(11)', 'YES', '', NULL, ''
'cost', 'int(11)', 'YES', '', NULL, ''
'shipping_wt', 'smallint(6)', 'YES', '', NULL, ''

My current solution via SQL is:

select brand,name,price from 
(select brand,name,price,row_number() 
over(partition by brand order by price desc) as rn 
from products) as sub 
where rn=1;

How do I do so without the aforementioned functions and keywords?

Gordon Linoff · Accepted Answer

You can use join and aggregation:

select p.*
from products p join
     (select p.brand, min(p.price) as min_price
      from products p
      group by p.brand
     ) pp
     on p.brand = pp.brand and p.price = pp.min_price;

This is actually equivalent to rank() rather than row_number(), but it is the same if there are no duplicate prices.

I strongly recommend using the window functions, though. Not only are they standard and an integral part of Hive, but they should be faster too.

Get first record from each groups without using ROW_NUMBER() or RANK()

Answers (1)

Related Questions