Global Warrior
Global Warrior

Reputation: 5110

Spark scala dataframe: Merging multiple columns into single column

I have a spark dataframe which looks something like below:

+---+------+----+
| id|animal|talk|
+---+------+----+
|  1|   bat|done|
|  2| mouse|mone|
|  3| horse| gun|
|  4| horse|some|
+---+------+----+

I want to generate a new column, say merged which would look something like

+---+-----------------------------------------------------------+
| id| merged columns                                            |
+---+-----------------------------------------------------------+
|  1| [{name: animal, value: bat}, {name: talk, value: done}]   |
|  2| [{name: animal, value: mouse}, {name: talk, value: mone}] |
|  3| [{name: animal, value: horse}, {name: talk, value: gun}]  |
|  4| [{name: animal, value: horse}, {name: talk, value: some}] |
+---+-----------------------------------------------------------+

Basically, combining all the columns into an Array of case class merged(name:String, value: String).

Can anyone help me with how to do this in Scala? Here for simplicity I have used only two columns but generic answer which can work for N number of columns would greatly help.

Upvotes: 4

Views: 2817

Answers (1)

Leo C
Leo C

Reputation: 22439

Your expected output doesn't seem to reflect your requirement of producing a list of name-value structured objects. If I understand it correctly, consider using foldLeft to iteratively convert the wanted columns to StructType name-value columns, and group them into an ArrayType column:

import org.apache.spark.sql.functions._

val df = Seq(
  (1, "bat", "done"),
  (2, "mouse", "mone"),
  (3, "horse", "gun"),
  (4, "horse", "some")
).toDF("id", "animal", "talk")

val cols = df.columns.filter(_ != "id")

val resultDF = cols.
  foldLeft(df)( (accDF, c) => 
    accDF.withColumn(c, struct(lit(c).as("name"), col(c).as("value")))
  ).
  select($"id", array(cols.map(col): _*).as("merged"))

resultDF.show(false)
// +---+-----------------------------+
// |id |merged                       |
// +---+-----------------------------+
// |1  |[[animal,bat], [talk,done]]  |
// |2  |[[animal,mouse], [talk,mone]]|
// |3  |[[animal,horse], [talk,gun]] |
// |4  |[[animal,horse], [talk,some]]|
// +---+-----------------------------+

resultDF.printSchema
// root
//  |-- id: integer (nullable = false)
//  |-- merged: array (nullable = false)
//  |    |-- element: struct (containsNull = false)
//  |    |    |-- name: string (nullable = false)
//  |    |    |-- value: string (nullable = true)

Upvotes: 4

Related Questions