Tim Edwards
Tim Edwards

Reputation: 1028

Impala Analytical functions in where clause

So the basic premise of the question is that I have some huge tables in hadoop that I need to get some samples from each month. I've mocked up the below to show the sort of thing I'm after but obviously it's not real data...

--Create the table
CREATE TABLE exp_dqss_team.testranking (
  Name STRING,
  Age INT,
  Favourite_Cheese STRING
  ) STORED AS PARQUET;

--Put some data in
INSERT INTO TABLE exp_dqss_team.testranking
VALUES (
  ('Tim', 33, 'Cheddar'),
  ('Martin', 49, 'Gorgonzola'),
  ('Will', 39, 'Brie'),
  ('Bob', 63, 'Cheddar'),
  ('Bill', 35, 'Brie'),
  ('Ben', 42, 'Gorgonzola'),
  ('Duncan', 55, 'Brie'),
  ('Dudley', 28, 'Cheddar'),
  ('Edmund', 27, 'Brie'),
  ('Baldrick', 29, 'Gorgonzola'));

What I want to get is something like the youngest 2 people in each cheese category. The below gives me the age rankings for each cheese category but won't restrict it to the top two:

SELECT RANK() OVER(PARTITION BY favourite_cheese ORDER BY age asc) AS rank_my_cheese, favourite_cheese, name, age
FROM exp_dqss_team.testranking;

If I add a WHERE clause it gives me the following error:

WHERE clause must not contain analytic expressions

SELECT RANK() OVER(PARTITION BY favourite_cheese ORDER BY age asc) AS rank_my_cheese, favourite_cheese, name, age
FROM exp_dqss_team.testranking
WHERE RANK() OVER(PARTITION BY favourite_cheese ORDER BY age asc) <3;

Is there a better way to do this than creating a table of all the rankings and then selecting from that with a WHERE clause on the ranking?

Upvotes: 1

Views: 2088

Answers (1)

Giorgos Altanis
Giorgos Altanis

Reputation: 2760

Could you try this?

select * from (
SELECT RANK() OVER(PARTITION BY favourite_cheese ORDER BY age asc) AS rank_my_cheese, favourite_cheese, name, age
FROM exp_dqss_team.testranking
) as temp
where rank_my_cheese <= 2;

Upvotes: 2

Related Questions