Grouping by values on a Spark Dataframe

Question

I'm working on a Spark dataframe containing this kind of data:

A,1,2,3
B,1,2,3
C,1,2,3
D,4,2,3

I want to aggegate this data on the three last columns, so the output would be :

ABC,1,2,3
D,4,2,3

How can I do it in scala ? (this is not a big dataframe so performance is secondary here)

vindev · Accepted Answer

As mentioned in the comments you can first use groupBy to group your columns and then use concat_ws on your first column. Here is one way of doing it,

//create you original DF
val df = Seq(("A",1,2,3),("B",1,2,3),("C",1,2,3),("D",4,2,3)).toDF("col1","col2","col3","col4")
df.show

//output
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   A|   1|   2|   3|
|   B|   1|   2|   3|
|   C|   1|   2|   3|
|   D|   4|   2|   3|
+----+----+----+----+

//group by "col2","col3","col4" and store "col1" as list and then
//convert it to string

df.groupBy("col2","col3","col4")
.agg(collect_list("col1").as("col1"))
//you can change the string separator by concat_ws first arg
.select(concat_ws("", $"col1") as "col1",$"col2",$"col3",$"col4").show

//output
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   D|   4|   2|   3|
| ABC|   1|   2|   3|
+----+----+----+----+

Grouping by values on a Spark Dataframe

Answers (2)

Related Questions