Reputation: 71
I want to convert each value of a spark Dataset (say 'x' rows and 'y' columns) into individual rows (result should be x*y rows) with an additional column.
For example,
ColA ColB ColC
1 2 3
4 5 6
Should produce,
NewColA NewColB
1 ColA
4 ColA
2 ColB
5 ColB
3 ColC
6 ColC
The values in NewColB are from the original column of the value in NewColA i.e. values 1 and 4 have values as ColA in NewColB because they originally came from ColA and so on.
I have seen a few implementations of explode() function in Java but I want to know how it can be used in my use case. Also note that input size maybe large (x*y may be in millions).
Upvotes: 1
Views: 453
Reputation: 2495
The simplest way to accomplish this is with the stack()
function built in to spark sql.
val df = Seq((1, 2, 3), (4, 5, 6)).toDF("ColA", "ColB", "ColC")
df.show()
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
| 1| 2| 3|
| 4| 5| 6|
+----+----+----+
val df2 = df.select(expr("stack(3, ColA, 'ColA', ColB, 'ColB', ColC, 'ColC') as (NewColA, NewColB)"))
df2.show()
+-------+-------+
|NewColA|NewColB|
+-------+-------+
| 1| ColA|
| 2| ColB|
| 3| ColC|
| 4| ColA|
| 5| ColB|
| 6| ColC|
+-------+-------+
sorry, examples are in scala, but should be easy to translate
It's also possible, albeit more complicated and less efficient to do this with a .flatMap()
.
Upvotes: 1