Aggregating a column based on column values in Presto/Hive

Question

I'm trying to create a dataset where I aggregate a column based on the values of another column. To give an example

 id      | score | id2 | offensive 
---------+-------+-----+-----------
 a1      | 1.2   | 1   | false     
 a2      | 36.0  | 1   | true      
 a3      | 1.2   | 1   | true      
 a4      | 36.0  | 1   | false

What I want is ultimately create an array of two lists to group by the offensive column and the id's relating to it and for them to be sorted by the score value. Something like the following:

 id      |id2  | clean_group | offensive_group
---------+-----+-------------+-----------------
 a1      | 1   | [a2, a3]    | [a4, a1]
 a2      | 1   | [a2, a3]    | [a4, a1]
 a3      | 1   | [a2, a3]    | [a4, a1]
 a4      | 1   | [a2, a3]    | [a4, a1]

I know this is redundant data but that is the requirement, and it should be noted id's will all be distinct and unique, and the id2 will all be the same. I've been looking at array_agg function in presto and collect_set in hive to try to accomplish this but having lots of trouble.

Can anyone help? Are we able to pass conditions in the aggregate somehow where we say array_agg(id where offensive=true) etc.

Dain Sundstrom · Accepted Answer

I think you want the aggregation with filter syntax: array_agg(id) filter (where offensive). This will only process the rows that match the filter.

Aggregating a column based on column values in Presto/Hive

Answers (2)

Related Questions