tribbloid
tribbloid

Reputation: 3858

Strange typecast error in sparkSQL.createDataFrame

The following code fails:

val RDD = sparkContext.parallelize(Seq(
  Row("123", new java.lang.Integer(456))
))
val schema = StructType(
  StructField("str", StringType) ::
  StructField("dbl", DoubleType) :: Nil
)
val df = sqlContext.createDataFrame(RDD, schema)
df.collect().foreach(println)

With this exception:

java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double
    at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
    at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getDouble(rows.scala:44)
    at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDouble(rows.scala:221)
    ....

Please be noted that this is just a test case, in real case type java.lang.Integer is returned from another function so I cannot create Row with unboxed numeric value from scratch.

How to solve this problem?

Upvotes: 2

Views: 5310

Answers (2)

Tzach Zohar
Tzach Zohar

Reputation: 37852

You can convert the Ints to Doubles before creating the DataFrame:

val newRdd = RDD.map({ case Row(str, i: java.lang.Integer) => Row(str, i.toDouble) })
val df = sqlContext.createDataFrame(newRdd, schema)

Upvotes: 0

Reactormonk
Reactormonk

Reputation: 21740

An Integer is not a Double and Spark is correct in complaining. Typecast manually:

val toDouble = udf {x: Int => x.toDouble}
df.withColumn("dbl", toDouble(df.col("ints")))

Upvotes: 4

Related Questions