Arindam Ghosh
Arindam Ghosh

Reputation: 11

Explode multiple columns into separate rows in Spark Scala

I have a DF in the following structure

Col1.                       Col2                    Col3
Data1Col1,Data2Col1.     Data1Col2,Data2Col2.    Data1Col3,Data2Col3

I want the resultant dataset to be of the following type:

Col1         Col2        Col3
Data1Col1.  Data1Col2.   Data1Col3
Data2Col1.  Data2Col2    Data2Col3

Please suggest me how to approach this. I have tried explode , but that results in duplicate rows.

Upvotes: 1

Views: 648

Answers (1)

Rohan Gupta
Rohan Gupta

Reputation: 94

val df = Seq(("C,D,E,F","M,N,O,P","K,P,B,P")).toDF("Col1","Col2","Col3") 
   
df.show
+-------+-------+-------+
|   Col1|   Col2|   Col3|
+-------+-------+-------+
|C,D,E,F|M,N,O,P|K,P,B,P|
+-------+-------+-------+
           
val res1 = df.withColumn("Col1",split(col("Col1"),",")).withColumn("Col2",split(col("Col2"),",")).withColumn("Col3",split(col("Col3"),","))
           
res1.show
+------------+------------+------------+
|        Col1|        Col2|        Col3|
+------------+------------+------------+
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|
+------------+------------+------------+
           
           
val zip = udf((x: Seq[String], y: Seq[String], z: Seq[String]) => z.zip(x.zip(y)))
           
val res14 = res1.withColumn("test",explode(zip(col("Col1"),col("Col2"),col("Col3")))).show
+------------+------------+------------+-----------+
|        Col1|        Col2|        Col3|       test|
+------------+------------+------------+-----------+
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[K, [C, M]]|
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[P, [D, N]]|
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[B, [E, O]]|
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[P, [F, P]]|
+------------+------------+------------+-----------+
           
       
res14.withColumn("t3",col("test._1")).withColumn("tn",col("test._2")).withColumn("t2",col("tn._2")).withColumn("t1",col("tn._1")).select("t1","t2","t3").show
+---+---+---+
| t1| t2| t3|
+---+---+---+
|  C|  M|  K|
|  D|  N|  P|
|  E|  O|  B|
|  F|  P|  P|
+---+---+---+

res1 - Initial Dataframe

res14 - intermediate Df

Upvotes: 1

Related Questions