Reputation: 906
This is a question identical to
Pyspark: Split multiple array columns into rows
but I want to know how to do it in scala
for a dataframe like this,
+---+---------+---------+---+
| a| b| c| d|
+---+---------+---------+---+
| 1|[1, 2, 3]|[, 8, 9] |foo|
+---+---------+---------+---+
I want to have it in following format
+---+---+-------+------+
| a| b| c | d |
+---+---+-------+------+
| 1| 1| None | foo |
| 1| 2| 8 | foo |
| 1| 3| 9 | foo |
+---+---+-------+------+
In scala, I know there's an explode function, but I don't think it's applicable here.
I tried
import org.apache.spark.sql.functions.arrays_zip
but I get an error, saying arrays_zip is not a member of org.apache.spark.sql.functions although it's clearly a function in https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html
Upvotes: 0
Views: 648
Reputation: 2072
the below answer might be helpful to you,
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
val arrayData = Seq(
Row(1,List(1,2,3),List(0,8,9),"foo"))
val arraySchema = new StructType().add("a",IntegerType).add("b", ArrayType(IntegerType)).add("c", ArrayType(IntegerType)).add("d",StringType)
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayData),arraySchema)
df.select($"a",$"d",explode($"b",$"c")).show(false)
val zip = udf((x: Seq[Int], y: Seq[Int]) => x.zip(y))
df.withColumn("vars", explode(zip($"b", $"c"))).select($"a", $"d",$"vars._1".alias("b"), $"vars._2".alias("c")).show()
/*
+---+---+---+---+
| a| d| b| c|
+---+---+---+---+
| 1|foo| 1| 0|
| 1|foo| 2| 8|
| 1|foo| 3| 9|
+---+---+---+---+
*/
Upvotes: 1