foglerit
foglerit

Reputation: 8279

Hive: Aggregate values by attribute into a JSON or MAP field

I have a table that looks like this:

|   user | attribute   |   value |
|--------|-------------|---------|
|      1 | A           |      10 |
|      1 | A           |      20 |
|      1 | B           |       5 |
|      2 | B           |      10 |
|      2 | B           |      15 |
|      2 | C           |     100 |
|      2 | C           |     200 |

I'd like to group this table by user and collect the sum of the value field into a JSON or a MAP with attributes as keys, like:

| user | sum_values_by_attribute  |
|------|--------------------------|
|    1 | {"A": 30, "B": 15}       |
|    2 | {"B": 25, "C": 300}      |

Is there a way to do that in Hive?

I've found related questions such as this and this but none consider the case of a summation over values.

Upvotes: 1

Views: 1456

Answers (2)

leftjoin
leftjoin

Reputation: 38335

JSON string corresponding to map<string, int> can be built in Hive using native functions only: aggregate by user, attribute, then concatenate pairs "key": value and aggregate array of them, concatenate array using concat_ws, add curly braces.

Demo:

with initial_data as (
select stack(7,
1,'A',40,
1,'A',20,
1,'B',5,
2,'B',10,
2,'B',15,
2,'C',100,
2,'C',200) as (`user`, attribute, value )
)

select `user`, concat('{',concat_ws(',',collect_set(concat('"', attribute, '": ',sum_value))), '}') as sum_values_by_attribute  
from
(--aggregate groupby user, attribute
  select `user`, attribute, sum(value) as sum_value from initial_data group by `user`, attribute
)s
group by `user`;

Result ( JSON string ):

user    sum_values_by_attribute
1       {"A": 60,"B": 5}
2       {"B": 25,"C": 300}

Note: If you are running this on Spark, you can cast( as map<string, int>), Hive does not support casting complex types cast.

Also map<string, string> can be easily done using native functions only: the same array of key-values pairs byt without double-quotes (like A:10) concatenate to comma delimited string using concat_ws and convert to map using str_to_map function (the same WITH CTE is skipped):

select `user`, str_to_map(concat_ws(',',collect_set(concat(attribute, ':',sum_value)))) as sum_values_by_attribute  
from
(--aggregate groupby user, attribute
  select `user`, attribute, sum(value) as sum_value from initial_data group by `user`, attribute
)s
group by `user`;

Result ( map<string, string> ):

user    sum_values_by_attribute
1       {"A":"60","B":"5"}
2       {"B":"25","C":"300"}

And if you need map<string, int>, unfortunately, it can not be done using Hive native functions only because map_to_str returns map<string, string>, not map<string, int>. You can try brickhouse collect function:

add jar '~/brickhouse/target/brickhouse-0.6.0.jar'; --check brickhouse site https://github.com/klout/brickhouse for instructions

create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';

select `user`, collect(attribute, sum_value) as sum_values_by_attribute  
from
(--aggregate groupby user, attribute
  select `user`, attribute, sum(value) as sum_value from initial_data group by `user`, attribute
)s
group by `user`;

Upvotes: 1

Koushik Roy
Koushik Roy

Reputation: 7407

You can first calculate the sum by attribute and user_id and then use collect list. Pls let me know if below output is fine. enter image description here

SQL Below -

select `user`,
collect_list(concat(att,":",cast(val as string))) sum_values_by_attribute  
from 
(select `user`,`attribute` att, sum(`value`) val from tmp2 group by u,att) tmp2
group by `user`;

Testing Query -

create table tmp2 ( `user` int, `attribute` string, `value` int);

insert into tmp2 select 1,'A',40;
insert into tmp2 select 1,'A',20;
insert into tmp2 select 1,'B',5;
insert into tmp2 select 2,'C',20;
insert into tmp2 select 1,'B',10;
insert into tmp2 select 2,'B',10;
insert into tmp2 select 2,'C',10;

select `user`,
collect_list(concat(att,":",cast(val as string))) sum_values_by_attribute  
from 
(select `user`,`attribute` att, sum(`value`) val from tmp2 group by u,att) tmp2
group by `user`;

Upvotes: 1

Related Questions