Ashish Singh
Ashish Singh

Reputation: 533

Group by to club multiple column value in hive to one column

I am trying to club multiple column value to one column based on group by key. Basically I will use this to create nested JSON using spark 1.6 dataframe api.

Sample Input Table abc:-

a     b     c       d       e       f       g
---------------------------------------------
aa    bb    cc      dd      ee      ff      gg
aa    bb    cc1     dd1     ee1     ff1     gg1
aa    bb    cc2     dd2     ee2     ff2     gg2
aa1   bb1   cc3     dd3     ee3     ff3     gg3
aa1   bb1   cc4     dd4     ee4     ff4     gg4

final output group by a,b :-

aa      bb      {{cc,dd,ee,ff,gg},{cc1,dd1,ee1,ff1,gg1},{cc2,dd2,ee2,ff2,gg2}}
aa1     bb1     {{cc3,dd3,ee3,ff3,gg3},{cc4,dd4,ee4,ff4,gg4}}

I tried using collect_list but it can only group one column. Don't know how to group multiple column together. I tried using concat string and then use collect on it but I will loose schema mapping as I have to finally dump it in json format. Clubbing the colunms in form of map or struct will also work. Please suggest some elegant approach/solution to this problem. Thanks Note : Using Spark 1.6

Upvotes: 4

Views: 2165

Answers (1)

David דודו Markovitz
David דודו Markovitz

Reputation: 44921

Both queries work with sqlContext.sql ("select ...");


select      a,b
           ,collect_list(array(c,d,e,f,g))

from        abc

group by    a,b
;

+-----+-----+----------------------------------------------------------------------------------------------+
| aa  | bb  | [["cc","dd","ee","ff","gg"],["cc1","dd1","ee1","ff1","gg1"],["cc2","dd2","ee2","ff2","gg2"]] |
+-----+-----+----------------------------------------------------------------------------------------------+
| aa1 | bb1 | [["cc3","dd3","ee3","ff3","gg3"],["cc4","dd4","ee4","ff4","gg4"]]                            |
+-----+-----+----------------------------------------------------------------------------------------------+

select      a,b
           ,collect_list(struct(c,d,e,f,g))

from        abc

group by    a,b
;

+-----+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| aa  | bb  | [{"col1":"cc","col2":"dd","col3":"ee","col4":"ff","col5":"gg"},{"col1":"cc1","col2":"dd1","col3":"ee1","col4":"ff1","col5":"gg1"},{"col1":"cc2","col2":"dd2","col3":"ee2","col4":"ff2","col5":"gg2"}] |
+-----+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| aa1 | bb1 | [{"col1":"cc3","col2":"dd3","col3":"ee3","col4":"ff3","col5":"gg3"},{"col1":"cc4","col2":"dd4","col3":"ee4","col4":"ff4","col5":"gg4"}]                                                               |
+-----+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Spark demo

[cloudera@quickstart ~]$ spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Type --help for more information.
[cloudera@quickstart ~]$ 

[cloudera@quickstart ~]$ spark-shell

scala> sqlContext.sql("select * from abc").show;
+---+---+---+---+---+---+---+
|  a|  b|  c|  d|  e|  f|  g|
+---+---+---+---+---+---+---+
| aa| bb| cc| dd| ee| ff| gg|
| aa| bb|cc1|dd1|ee1|ff1|gg1|
| aa| bb|cc2|dd2|ee2|ff2|gg2|
|aa1|bb1|cc3|dd3|ee3|ff3|gg3|
|aa1|bb1|cc4|dd4|ee4|ff4|gg4|
+---+---+---+---+---+---+---+


scala> sqlContext.sql("select a,b,collect_list(array(c,d,e,f,g)) from abc group by a,b").show;
+---+---+--------------------+                                                  
|  a|  b|                 _c2|
+---+---+--------------------+
|aa1|bb1|[[cc3, dd3, ee3, ...|
| aa| bb|[[cc, dd, ee, ff,...|
+---+---+--------------------+


scala> sqlContext.sql("select a,b,collect_list(struct(c,d,e,f,g)) from abc group by a,b").show;
+---+---+--------------------+                                                  
|  a|  b|                 _c2|
+---+---+--------------------+
|aa1|bb1|[[cc3,dd3,ee3,ff3...|
| aa| bb|[[cc,dd,ee,ff,gg]...|
+---+---+--------------------+

Upvotes: 4

Related Questions