Wolf Rendall
Wolf Rendall

Reputation: 417

Replace Values in dataframe conditionally when key appears in a map

I have two objects in a scala spark environment. The first is a dataframe with many columns of type String. The second is a Map with column_names as keys and the value of each key is another Map which defines replacement values.

I want to iterate over the Map of Maps, match the column_name key to the dataframe key, then use the Map of replacement values to replace the values. If the value is not present in the Map of replacement values, then I want to return the column's original value.

What's the best way to accomplish this:

val replacements = Map("Col_1" -> Map("Male" -> "0", "Female" -> "1"),
                       "Col_2" -> Map("2" -> "1", "3" -> "1", "4" -> "1")

val df = Seq(("Mary", "Female", "0"), ("John", "Male", "2"), ("Luke", "Male", "1")).toDF(List("name", "Col_1", "Col_2").toSeq:_*)

df.show
+----+------+-----+
|name| Col_1|Col_2|
+----+------+-----+
|Mary|Female|    0|
|John|  Male|    2|
|Luke|  Male|    1|
+----+------+-----+

//I hoped to do something like this:

val fixed_df = replacements.foldLeft(df) { 
case (tmpDF, (colName, foo: Map[key, v])) => tmpDF.withColumn(colName, when($"$colName" isin(foo.keys.to[List]:_*), foo($"$colName")).otherwise($"$colName"))
    }

//to get output like this:
fixed_df.take(3)
+----+------+-----+
|name| Col_1|Col_2|
+----+------+-----+
|Mary|     1|    0|
|John|     0|    1|
|Luke|     0|    1|
+----+------+-----+

Note, thanks to @tzachz for help with a similar, different solution.

Upvotes: 0

Views: 1321

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35249

All you need is na.replace:

replacements.keys.foldLeft(df)(
  (acc, key) => acc.na.replace(key, replacements(key))
).show

// +----+-----+-----+
// |name|Col_1|Col_2|
// +----+-----+-----+
// |Mary|    1|    0|
// |John|    0|    1|
// |Luke|    0|    1|
// +----+-----+-----+

Upvotes: 1

Related Questions