Reputation: 5110
I have a spark dataframe which looks something like below:
+---+------+----+
| id|animal|talk|
+---+------+----+
| 1| bat|done|
| 2| mouse|mone|
| 3| horse| gun|
| 4| horse|some|
+---+------+----+
I want to generate a new column, say merged which would look something like
+---+-----------------------------------------------------------+
| id| merged columns |
+---+-----------------------------------------------------------+
| 1| [{name: animal, value: bat}, {name: talk, value: done}] |
| 2| [{name: animal, value: mouse}, {name: talk, value: mone}] |
| 3| [{name: animal, value: horse}, {name: talk, value: gun}] |
| 4| [{name: animal, value: horse}, {name: talk, value: some}] |
+---+-----------------------------------------------------------+
Basically, combining all the columns into an Array
of case class merged(name:String, value: String)
.
Can anyone help me with how to do this in Scala? Here for simplicity I have used only two columns but generic answer which can work for N number of columns would greatly help.
Upvotes: 4
Views: 2817
Reputation: 22439
Your expected output doesn't seem to reflect your requirement of producing a list of name-value structured objects. If I understand it correctly, consider using foldLeft
to iteratively convert the wanted columns to StructType
name-value columns, and group them into an ArrayType
column:
import org.apache.spark.sql.functions._
val df = Seq(
(1, "bat", "done"),
(2, "mouse", "mone"),
(3, "horse", "gun"),
(4, "horse", "some")
).toDF("id", "animal", "talk")
val cols = df.columns.filter(_ != "id")
val resultDF = cols.
foldLeft(df)( (accDF, c) =>
accDF.withColumn(c, struct(lit(c).as("name"), col(c).as("value")))
).
select($"id", array(cols.map(col): _*).as("merged"))
resultDF.show(false)
// +---+-----------------------------+
// |id |merged |
// +---+-----------------------------+
// |1 |[[animal,bat], [talk,done]] |
// |2 |[[animal,mouse], [talk,mone]]|
// |3 |[[animal,horse], [talk,gun]] |
// |4 |[[animal,horse], [talk,some]]|
// +---+-----------------------------+
resultDF.printSchema
// root
// |-- id: integer (nullable = false)
// |-- merged: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- name: string (nullable = false)
// | | |-- value: string (nullable = true)
Upvotes: 4