Rohit K
Rohit K

Reputation: 71

Convert each value of Java spark Dataset into a row using explode()

I want to convert each value of a spark Dataset (say 'x' rows and 'y' columns) into individual rows (result should be x*y rows) with an additional column.

For example,

ColA    ColB    ColC

 1        2       3
 4        5       6

Should produce,

NewColA NewColB

   1      ColA
   4      ColA
   2      ColB
   5      ColB
   3      ColC
   6      ColC    

The values in NewColB are from the original column of the value in NewColA i.e. values 1 and 4 have values as ColA in NewColB because they originally came from ColA and so on.

I have seen a few implementations of explode() function in Java but I want to know how it can be used in my use case. Also note that input size maybe large (x*y may be in millions).

Upvotes: 1

Views: 453

Answers (1)

Travis Hegner
Travis Hegner

Reputation: 2495

The simplest way to accomplish this is with the stack() function built in to spark sql.

val df = Seq((1, 2, 3), (4, 5, 6)).toDF("ColA", "ColB", "ColC")
df.show()

+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
|   1|   2|   3|
|   4|   5|   6|
+----+----+----+

val df2 = df.select(expr("stack(3, ColA, 'ColA', ColB, 'ColB', ColC, 'ColC') as (NewColA, NewColB)"))
df2.show()

+-------+-------+
|NewColA|NewColB|
+-------+-------+
|      1|   ColA|
|      2|   ColB|
|      3|   ColC|
|      4|   ColA|
|      5|   ColB|
|      6|   ColC|
+-------+-------+

sorry, examples are in scala, but should be easy to translate

It's also possible, albeit more complicated and less efficient to do this with a .flatMap().

Upvotes: 1

Related Questions