Reputation: 7411
What I want here is to replace a value in a specific column to null if it's empty String.
The reason is I am using org.apache.spark.sql.functions.coalesce
to fill one of the Dataframe's column based on another columns, but I have noticed in some rows the value is empty String
instead of null
so the coalesce
function doesn't work as expected.
val myCoalesceColumnorder: Seq[String] = Seq("xx", "yy", "zz"),
val resolvedDf = df.select(
df("a"),
df("b"),
lower(org.apache.spark.sql.functions.coalesce(myCoalesceColumnorder.map(x => adjust(x)): _*)).as("resolved_id")
)
In the above example, I expected to first fill resolved_id
with column xx
if it' not null and if it's null with column yy
and so on. But since sometime column xx
is filled with ""
instead of null I get ""
in 'resolved_id'.
I have tried to fix it with
resolvedDf.na.replace("resolved_id", Map("" -> null))
But based on the na.replace
documentation it only works if both key and value are either Bolean
or String
or Double
so I can not use null
here.
I don't want to use UDF
because of the performance issue, I just want to know is there any other trick to solve this issue?
One other way I can fix this is by using when
but not sure about the performance
resolvedDf
.withColumn("resolved_id", when(col("resolved_id").equalTo(""), null).otherwise(col("resolved_id")))
Upvotes: 2
Views: 10814
Reputation: 2556
This is the right way with better performance
resolvedDf.withColumn("resolved_id", when($"resolved_id" =!= "", $"resolved_id"))
Basically no need to use otherwise
method.
You can check sources::: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L507
/**
* Evaluates a list of conditions and returns one of multiple possible result expressions.
* If otherwise is not defined at the end, null is returned for unmatched conditions.
*
* {{{
* // Example: encoding gender string column into integer.
*
* // Scala:
* people.select(when(people("gender") === "male", 0)
* .when(people("gender") === "female", 1)
* .otherwise(2))
*
* // Java:
* people.select(when(col("gender").equalTo("male"), 0)
* .when(col("gender").equalTo("female"), 1)
* .otherwise(2))
* }}}
*
* @group expr_ops
* @since 1.4.0
*/
def when(condition: Column, value: Any): Column = this.expr match {
case CaseWhen(branches, None) =>
withExpr { CaseWhen(branches :+ ((condition.expr, lit(value).expr))) }
case CaseWhen(branches, Some(_)) =>
throw new IllegalArgumentException(
"when() cannot be applied once otherwise() is applied")
case _ =>
throw new IllegalArgumentException(
"when() can only be applied on a Column previously generated by when() function")
}
Upvotes: 2