himanshuIIITian
himanshuIIITian

Reputation: 6085

Why Spark SQL translates String "null" to Object null for Float/Double types?

I have a dataframe containing float and double values.

scala> val df = List((Float.NaN, Double.NaN), (1f, 0d)).toDF("x", "y")
df: org.apache.spark.sql.DataFrame = [x: float, y: double]

scala> df.show
+---+---+
|  x|  y|
+---+---+
|NaN|NaN|
|1.0|0.0|
+---+---+

scala> df.printSchema
root
 |-- x: float (nullable = false)
 |-- y: double (nullable = false)

When I replace NaN values with null value, I gave null as String to the Map in fill operation.

scala> val map = df.columns.map((_, "null")).toMap
map: scala.collection.immutable.Map[String,String] = Map(x -> null, y -> null)

scala> df.na.fill(map).printSchema
root
 |-- x: float (nullable = true)
 |-- y: double (nullable = true)


scala> df.na.fill(map).show
+----+----+
|   x|   y|
+----+----+
|null|null|
| 1.0| 0.0|
+----+----+

And I got correct value. But I was not able to understand as to How/Why Spark SQL is translating null as a String to a null object ?

Upvotes: 4

Views: 3449

Answers (3)

koiralo
koiralo

Reputation: 23109

If you looked in to the fill function in Dataset, It checks the datatype and tries to convert to datatype of its column's schema. If it can be converted then it converts otherwise it returns null.

It does not convert to "null" to object null but it returns null if exception occurs while converting.

val map = df.columns.map((_, "WHATEVER")).toMap

gives null

and val map = df.columns.map((_, "9999.99")).toMap

gives 9999.99

If you want to update the NAN with same datatype, you can get result as expected.

Hope this helps you to understand!

Upvotes: 3

Raphael Roth
Raphael Roth

Reputation: 27373

I've looked into the source-code, in fill your string is casted to a double/float:

private def fillCol[T](col: StructField, replacement: T): Column = {
    col.dataType match {
      case DoubleType | FloatType =>
        coalesce(nanvl(df.col("`" + col.name + "`"), lit(null)),
          lit(replacement).cast(col.dataType)).as(col.name)
      case _ =>
        coalesce(df.col("`" + col.name + "`"), lit(replacement).cast(col.dataType)).as(col.name)
    }
  }

The relevant source code for casting is here (similar code for Floats):

Cast.scala (taken from Spark 1.6.3) :

  // DoubleConverter
  private[this] def castToDouble(from: DataType): Any => Any = from match {
    case StringType =>
      buildCast[UTF8String](_, s => try s.toString.toDouble catch {
        case _: NumberFormatException => null
      })
    case BooleanType =>
      buildCast[Boolean](_, b => if (b) 1d else 0d)
    case DateType =>
      buildCast[Int](_, d => null)
    case TimestampType =>
      buildCast[Long](_, t => timestampToDouble(t))
    case x: NumericType =>
      b => x.numeric.asInstanceOf[Numeric[Any]].toDouble(b)
  }

So Spark trys to convert the String to a Double (s.toString.toDouble), if thats not possible (i.e. you get a NumberFormatException) you get a null. So instead of "null" you could also use "foo" witht he same outcome. But if you use "1.0" in your map, than NaNs and nulls will be replaced with 1.0 because the String "1.0" is indeed parsable to a Double.

Upvotes: 1

Shaido
Shaido

Reputation: 28322

It's not that "null" as a String translates to a null object. You can try using the transformation with any String and still get null (with the exception of strings that can directly be cast to double/float, see below). For example, using

val map = df.columns.map((_, "abc")).toMap

would give the same result. My guess is that as the columns are of type float and double converting the NaN values to a string will give null. Using a number instead would work as expected, e.g.

val map = df.columns.map((_, 1)).toMap

As some strings can directly be cast to double or float those are also possible to use in this case.

val map = df.columns.map((_, "1")).toMap

Upvotes: 1

Related Questions