Conditional Spark map() function based on input columns

Question

What I'm trying to achieve here is sending to Spark SQL map function conditionally generated columns depending on if they have null, 0 or any other value I may want.

Take for example this initial DF.

val initialDF = Seq(
  ("a", "b", 1), 
  ("a", "b", null), 
  ("a", null, 0)
).toDF("field1", "field2", "field3")

From that initial DataFrame I want to generate yet another column which will be a map, like this.

initialDF.withColumn("thisMap", MY_FUNCTION)

My current approach to this is basically take a Seq[String] in a method a flatMap the key-value pairs that the Spark SQL method receives, like this.

def toMap(columns: String*): Column = {
  map(
    columns.flatMap(column => List(lit(column), col(column))): _*
  )
}

But then, filtering becomes a Scala thing and is quite a mess.

What I would like to obtain after the processing would be, for each of those rows, the next DataFrame.

val initialDF = Seq(
  ("a", "b", 1, Map("field1" -> "a", "field2" -> "b", "field3" -> 1)),
  ("a", "b", null, Map("field1" -> "a", "field2" -> "b")),
  ("a", null, 0, Map("field1" -> "a"))
)
  .toDF("field1", "field2", "field3", "thisMap")

I was wondering if this can be achieved using the Column API which is way more intuitive with .isNull or .equalTo?

Fqp · Accepted Answer

Here's a small improvement on Lamanus' answer above which only loops over df.columns once:

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

case class Record(field1: String, field2: String, field3: java.lang.Integer)

val df = Seq(
  Record("a", "b", 1),
  Record("a", "b", null),
  Record("a", null, 0)
).toDS

df.show

// +------+------+------+
// |field1|field2|field3|
// +------+------+------+
// |     a|     b|     1|
// |     a|     b|  null|
// |     a|  null|     0|
// +------+------+------+

df.withColumn("thisMap", map_concat(
    df.columns.map { colName => 
        when(col(colName).isNull or col(colName) === 0, map())
        .otherwise(map(lit(colName), col(colName)))
    }: _*
)).show(false)

// +------+------+------+---------------------------------------+
// |field1|field2|field3|thisMap                                |
// +------+------+------+---------------------------------------+
// |a     |b     |1     |[field1 -> a, field2 -> b, field3 -> 1]|
// |a     |b     |null  |[field1 -> a, field2 -> b]             |
// |a     |null  |0     |[field1 -> a]                          |
// +------+------+------+---------------------------------------+

Conditional Spark map() function based on input columns

Answers (2)

Related Questions