Reputation: 1228
I'm trying to create a dataset where I aggregate a column based on the values of another column. To give an example
id | score | id2 | offensive
---------+-------+-----+-----------
a1 | 1.2 | 1 | false
a2 | 36.0 | 1 | true
a3 | 1.2 | 1 | true
a4 | 36.0 | 1 | false
What I want is ultimately create an array of two lists to group by the offensive column and the id's relating to it and for them to be sorted by the score value. Something like the following:
id |id2 | clean_group | offensive_group
---------+-----+-------------+-----------------
a1 | 1 | [a2, a3] | [a4, a1]
a2 | 1 | [a2, a3] | [a4, a1]
a3 | 1 | [a2, a3] | [a4, a1]
a4 | 1 | [a2, a3] | [a4, a1]
I know this is redundant data but that is the requirement, and it should be noted id's will all be distinct and unique, and the id2 will all be the same. I've been looking at array_agg
function in presto and collect_set
in hive to try to accomplish this but having lots of trouble.
Can anyone help? Are we able to pass conditions in the aggregate somehow where we say array_agg(id where offensive=true)
etc.
Upvotes: 1
Views: 9905
Reputation: 1549
Considering offensive column is a String, I think this should help you
SELECT
id,
id2,
CASE
WHEN offensive = 'true' then concat_ws(',',collect_set(id))
END AS clean_group,
CASE
WHEN offensive = 'false' then concat_ws(',',collect_set(id))
END AS offensive_group
FROM
table_name;
Upvotes: 3
Reputation: 2858
I think you want the aggregation with filter syntax: array_agg(id) filter (where offensive)
. This will only process the rows that match the filter.
Upvotes: 6