alex-arkhipov
alex-arkhipov

Reputation: 72

How to pick rows with higher value only in Hive

I have the following data in Hive (sorted by Count column):

ItemX   ItemXX  Count
-----   ------  -----
item1   item11  8
item1   item12  2
item1   item13  1
item2   item21  10
item2   item22  1
item3   item31  9
item3   item32  7
item3   item33  5

I need to filter max Count lines only. Like this:

item1   item11  8
item2   item21  10
item3   item31  9

How can I do this in Hive QL?

Upvotes: 1

Views: 60

Answers (2)

Iskuskov Alexander
Iskuskov Alexander

Reputation: 4375

In common cases (without sorting) you can use these queries:

-- 1.
SELECT a.* FROM tbl a 
INNER JOIN (
    SELECT ItemX, MAX(Count) Count 
    FROM tbl 
    GROUP BY ItemX
) b ON a.ItemX = b.ItemX AND a.Count = b.Count;

-- 2.
SELECT a.* FROM tbl a 
LEFT OUTER JOIN tbl b 
    ON a.ItemX = b.ItemX AND a.Count < b.Count 
WHERE b.ItemX  IS NULL;

Note: Second query can be execute with Hive 2.2.0+ because complex expressions in ON clause are not supported before this version (see Hive wiki).

Upvotes: 1

leftjoin
leftjoin

Reputation: 38290

Use row_number() function:

select ItemX, ItemXX, Count
  from
     (
      select ItemX, ItemXX, Count,
             row_number() over(partition by ItemX order by Count desc) rn 
      from table_name 
     )s
where rn=1

Upvotes: 3

Related Questions