Reputation: 533
I am trying to club multiple column value to one column based on group by key. Basically I will use this to create nested JSON using spark 1.6 dataframe api.
Sample Input Table abc:-
a b c d e f g
---------------------------------------------
aa bb cc dd ee ff gg
aa bb cc1 dd1 ee1 ff1 gg1
aa bb cc2 dd2 ee2 ff2 gg2
aa1 bb1 cc3 dd3 ee3 ff3 gg3
aa1 bb1 cc4 dd4 ee4 ff4 gg4
final output group by a,b :-
aa bb {{cc,dd,ee,ff,gg},{cc1,dd1,ee1,ff1,gg1},{cc2,dd2,ee2,ff2,gg2}}
aa1 bb1 {{cc3,dd3,ee3,ff3,gg3},{cc4,dd4,ee4,ff4,gg4}}
I tried using collect_list but it can only group one column. Don't know how to group multiple column together. I tried using concat string and then use collect on it but I will loose schema mapping as I have to finally dump it in json format. Clubbing the colunms in form of map or struct will also work. Please suggest some elegant approach/solution to this problem. Thanks Note : Using Spark 1.6
Upvotes: 4
Views: 2165
Reputation: 44921
Both queries work with sqlContext.sql ("select ...");
select a,b
,collect_list(array(c,d,e,f,g))
from abc
group by a,b
;
+-----+-----+----------------------------------------------------------------------------------------------+
| aa | bb | [["cc","dd","ee","ff","gg"],["cc1","dd1","ee1","ff1","gg1"],["cc2","dd2","ee2","ff2","gg2"]] |
+-----+-----+----------------------------------------------------------------------------------------------+
| aa1 | bb1 | [["cc3","dd3","ee3","ff3","gg3"],["cc4","dd4","ee4","ff4","gg4"]] |
+-----+-----+----------------------------------------------------------------------------------------------+
select a,b
,collect_list(struct(c,d,e,f,g))
from abc
group by a,b
;
+-----+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| aa | bb | [{"col1":"cc","col2":"dd","col3":"ee","col4":"ff","col5":"gg"},{"col1":"cc1","col2":"dd1","col3":"ee1","col4":"ff1","col5":"gg1"},{"col1":"cc2","col2":"dd2","col3":"ee2","col4":"ff2","col5":"gg2"}] |
+-----+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| aa1 | bb1 | [{"col1":"cc3","col2":"dd3","col3":"ee3","col4":"ff3","col5":"gg3"},{"col1":"cc4","col2":"dd4","col3":"ee4","col4":"ff4","col5":"gg4"}] |
+-----+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[cloudera@quickstart ~]$ spark-shell --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Type --help for more information.
[cloudera@quickstart ~]$
[cloudera@quickstart ~]$ spark-shell
scala> sqlContext.sql("select * from abc").show;
+---+---+---+---+---+---+---+
| a| b| c| d| e| f| g|
+---+---+---+---+---+---+---+
| aa| bb| cc| dd| ee| ff| gg|
| aa| bb|cc1|dd1|ee1|ff1|gg1|
| aa| bb|cc2|dd2|ee2|ff2|gg2|
|aa1|bb1|cc3|dd3|ee3|ff3|gg3|
|aa1|bb1|cc4|dd4|ee4|ff4|gg4|
+---+---+---+---+---+---+---+
scala> sqlContext.sql("select a,b,collect_list(array(c,d,e,f,g)) from abc group by a,b").show;
+---+---+--------------------+
| a| b| _c2|
+---+---+--------------------+
|aa1|bb1|[[cc3, dd3, ee3, ...|
| aa| bb|[[cc, dd, ee, ff,...|
+---+---+--------------------+
scala> sqlContext.sql("select a,b,collect_list(struct(c,d,e,f,g)) from abc group by a,b").show;
+---+---+--------------------+
| a| b| _c2|
+---+---+--------------------+
|aa1|bb1|[[cc3,dd3,ee3,ff3...|
| aa| bb|[[cc,dd,ee,ff,gg]...|
+---+---+--------------------+
Upvotes: 4