Ace Haidrey
Ace Haidrey

Reputation: 1228

Aggregating a column based on column values in Presto/Hive

I'm trying to create a dataset where I aggregate a column based on the values of another column. To give an example

 id      | score | id2 | offensive 
---------+-------+-----+-----------
 a1      | 1.2   | 1   | false     
 a2      | 36.0  | 1   | true      
 a3      | 1.2   | 1   | true      
 a4      | 36.0  | 1   | false  

What I want is ultimately create an array of two lists to group by the offensive column and the id's relating to it and for them to be sorted by the score value. Something like the following:

 id      |id2  | clean_group | offensive_group
---------+-----+-------------+-----------------
 a1      | 1   | [a2, a3]    | [a4, a1]
 a2      | 1   | [a2, a3]    | [a4, a1]
 a3      | 1   | [a2, a3]    | [a4, a1]
 a4      | 1   | [a2, a3]    | [a4, a1]

I know this is redundant data but that is the requirement, and it should be noted id's will all be distinct and unique, and the id2 will all be the same. I've been looking at array_agg function in presto and collect_set in hive to try to accomplish this but having lots of trouble.

Can anyone help? Are we able to pass conditions in the aggregate somehow where we say array_agg(id where offensive=true) etc.

Upvotes: 1

Views: 9905

Answers (2)

Rohit Nimmala
Rohit Nimmala

Reputation: 1549

Considering offensive column is a String, I think this should help you

SELECT
 id,
 id2,
 CASE
  WHEN offensive = 'true' then concat_ws(',',collect_set(id))
 END AS clean_group,
 CASE
  WHEN offensive = 'false' then concat_ws(',',collect_set(id))
 END AS offensive_group
FROM
table_name; 

Upvotes: 3

Dain Sundstrom
Dain Sundstrom

Reputation: 2858

I think you want the aggregation with filter syntax: array_agg(id) filter (where offensive). This will only process the rows that match the filter.

Upvotes: 6

Related Questions