Reputation: 877
I'm trying to create a new DataFrame from an existing DataFrame in Scala/Spark.
Below is my existing DataFrame:
+----------+-----------------------+
| group| value|
+----------+-----------------------+
| 4|[blah blah blah blah...|
| 0|[blah blah blah blah...|
| 1|[blah blah blah blah...|
| 1|[blah blah blah blah...|
| 0|[blah blah blah blah...|
| 2|[blah blah blah blah...|
| 0|[blah blah blah blah...|
| 2|[blah blah blah blah...|
//and so on
+----------+--------------------+
Now I want to group above DataFrame by group
column, then for each group, aggregate value
into a list of original values, to produce something like below:
+----------+---------------------------------+
| group| value|
+----------+---------------------------------+
| 0|[[blah blah...],[blah blah...]...|
| 1|[[blah blah...],[blah blah...]...|
| 2|[[blah blah...],[blah blah...]...|
| 3|[[blah blah...],[blah blah...]...|
| 4|[[blah blah...],[blah blah...]...|
+----------+---------------------------------+
How can I achieve it?
Upvotes: 1
Views: 596
Reputation: 10382
Try below code.
scala> df.show(false)
+-----+---------------------+
|group|value |
+-----+---------------------+
|4 |[blah blah blah blah]|
|0 |[blah blah blah blah]|
|1 |[blah blah blah blah]|
|1 |[blah blah blah blah]|
|0 |[blah blah blah blah]|
|2 |[blah blah blah blah]|
|0 |[blah blah blah blah]|
|2 |[blah blah blah blah]|
+-----+---------------------+
scala>
df
.groupBy($"group")
.agg(collect_list($"value").as("value"))
.orderBy($"group".asc)
.show(false)
+-----+---------------------------------------------------------------------+
|group|value |
+-----+---------------------------------------------------------------------+
|0 |[[blah blah blah blah], [blah blah blah blah], [blah blah blah blah]]|
|1 |[[blah blah blah blah], [blah blah blah blah]] |
|2 |[[blah blah blah blah], [blah blah blah blah]] |
|4 |[[blah blah blah blah]] |
+-----+---------------------------------------------------------------------+
Upvotes: 2