digdigdoot
digdigdoot

Reputation: 761

Get first record from each groups without using ROW_NUMBER() or RANK()

Due Hive software constraints, I don't have the luxury of using RANK(), PARTITION BY, and OVER().

What I want to do is retrieve the first row from each group.

The table schema is as follows:

# Field, Type, Null, Key, Default, Extra
'prod_id', 'int(11)', 'NO', 'PRI', NULL, ''
'brand', 'varchar(20)', 'YES', '', NULL, ''
'name', 'varchar(75)', 'YES', '', NULL, ''
'price', 'int(11)', 'YES', '', NULL, ''
'cost', 'int(11)', 'YES', '', NULL, ''
'shipping_wt', 'smallint(6)', 'YES', '', NULL, ''

My current solution via SQL is:

select brand,name,price from 
(select brand,name,price,row_number() 
over(partition by brand order by price desc) as rn 
from products) as sub 
where rn=1;

How do I do so without the aforementioned functions and keywords?

Upvotes: 1

Views: 885

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1271111

You can use join and aggregation:

select p.*
from products p join
     (select p.brand, min(p.price) as min_price
      from products p
      group by p.brand
     ) pp
     on p.brand = pp.brand and p.price = pp.min_price;

This is actually equivalent to rank() rather than row_number(), but it is the same if there are no duplicate prices.

I strongly recommend using the window functions, though. Not only are they standard and an integral part of Hive, but they should be faster too.

Upvotes: 2

Related Questions