Reputation: 417
I have two objects in a scala spark environment. The first is a dataframe with many columns of type String. The second is a Map with column_names as keys and the value of each key is another Map which defines replacement values.
I want to iterate over the Map of Maps, match the column_name key to the dataframe key, then use the Map of replacement values to replace the values. If the value is not present in the Map of replacement values, then I want to return the column's original value.
What's the best way to accomplish this:
val replacements = Map("Col_1" -> Map("Male" -> "0", "Female" -> "1"),
"Col_2" -> Map("2" -> "1", "3" -> "1", "4" -> "1")
val df = Seq(("Mary", "Female", "0"), ("John", "Male", "2"), ("Luke", "Male", "1")).toDF(List("name", "Col_1", "Col_2").toSeq:_*)
df.show
+----+------+-----+
|name| Col_1|Col_2|
+----+------+-----+
|Mary|Female| 0|
|John| Male| 2|
|Luke| Male| 1|
+----+------+-----+
//I hoped to do something like this:
val fixed_df = replacements.foldLeft(df) {
case (tmpDF, (colName, foo: Map[key, v])) => tmpDF.withColumn(colName, when($"$colName" isin(foo.keys.to[List]:_*), foo($"$colName")).otherwise($"$colName"))
}
//to get output like this:
fixed_df.take(3)
+----+------+-----+
|name| Col_1|Col_2|
+----+------+-----+
|Mary| 1| 0|
|John| 0| 1|
|Luke| 0| 1|
+----+------+-----+
Note, thanks to @tzachz for help with a similar, different solution.
Upvotes: 0
Views: 1321
Reputation: 35249
All you need is na.replace
:
replacements.keys.foldLeft(df)(
(acc, key) => acc.na.replace(key, replacements(key))
).show
// +----+-----+-----+
// |name|Col_1|Col_2|
// +----+-----+-----+
// |Mary| 1| 0|
// |John| 0| 1|
// |Luke| 0| 1|
// +----+-----+-----+
Upvotes: 1