Reputation: 6085
I have a dataframe containing float
and double
values.
scala> val df = List((Float.NaN, Double.NaN), (1f, 0d)).toDF("x", "y")
df: org.apache.spark.sql.DataFrame = [x: float, y: double]
scala> df.show
+---+---+
| x| y|
+---+---+
|NaN|NaN|
|1.0|0.0|
+---+---+
scala> df.printSchema
root
|-- x: float (nullable = false)
|-- y: double (nullable = false)
When I replace NaN
values with null
value, I gave null
as String to the Map in fill
operation.
scala> val map = df.columns.map((_, "null")).toMap
map: scala.collection.immutable.Map[String,String] = Map(x -> null, y -> null)
scala> df.na.fill(map).printSchema
root
|-- x: float (nullable = true)
|-- y: double (nullable = true)
scala> df.na.fill(map).show
+----+----+
| x| y|
+----+----+
|null|null|
| 1.0| 0.0|
+----+----+
And I got correct value. But I was not able to understand as to How/Why Spark SQL is translating null
as a String to a null
object ?
Upvotes: 4
Views: 3449
Reputation: 23109
If you looked in to the fill
function in Dataset
, It checks the datatype and tries to convert to datatype of its column's schema. If it can be converted then it converts otherwise it returns null.
It does not convert to "null
" to object null
but it returns null if exception occurs while converting.
val map = df.columns.map((_, "WHATEVER")).toMap
gives null
and val map = df.columns.map((_, "9999.99")).toMap
gives 9999.99
If you want to update the NAN
with same datatype, you can get result as expected.
Hope this helps you to understand!
Upvotes: 3
Reputation: 27373
I've looked into the source-code, in fill
your string is casted to a double/float:
private def fillCol[T](col: StructField, replacement: T): Column = {
col.dataType match {
case DoubleType | FloatType =>
coalesce(nanvl(df.col("`" + col.name + "`"), lit(null)),
lit(replacement).cast(col.dataType)).as(col.name)
case _ =>
coalesce(df.col("`" + col.name + "`"), lit(replacement).cast(col.dataType)).as(col.name)
}
}
The relevant source code for casting is here (similar code for Floats):
Cast.scala (taken from Spark 1.6.3) :
// DoubleConverter
private[this] def castToDouble(from: DataType): Any => Any = from match {
case StringType =>
buildCast[UTF8String](_, s => try s.toString.toDouble catch {
case _: NumberFormatException => null
})
case BooleanType =>
buildCast[Boolean](_, b => if (b) 1d else 0d)
case DateType =>
buildCast[Int](_, d => null)
case TimestampType =>
buildCast[Long](_, t => timestampToDouble(t))
case x: NumericType =>
b => x.numeric.asInstanceOf[Numeric[Any]].toDouble(b)
}
So Spark trys to convert the String
to a Double
(s.toString.toDouble
), if thats not possible (i.e. you get a NumberFormatException
) you get a null
. So instead of "null"
you could also use "foo"
witht he same outcome. But if you use "1.0"
in your map, than NaNs
and nulls
will be replaced with 1.0
because the String
"1.0"
is indeed parsable to a Double
.
Upvotes: 1
Reputation: 28322
It's not that "null" as a String translates to a null
object. You can try using the transformation with any String and still get null
(with the exception of strings that can directly be cast to double/float, see below). For example, using
val map = df.columns.map((_, "abc")).toMap
would give the same result. My guess is that as the columns are of type float
and double
converting the NaN
values to a string will give null
. Using a number instead would work as expected, e.g.
val map = df.columns.map((_, 1)).toMap
As some strings can directly be cast to double
or float
those are also possible to use in this case.
val map = df.columns.map((_, "1")).toMap
Upvotes: 1