Reputation: 1868
I have previously written a Dataset[T] to a csv file.
In this case T is a case class that contains field x: Option[BigDecimal]
When I attempt to load the file back into a Dataset[T] I see the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `x` from double to decimal(38,18) as it may truncate.
I guess the reason is that the inferred schema contains a double rather than BigDecimal column. Is there a way around this issue? I wish to avoid casting based on column name because the read code is part of a generic function. My read code is below:
val a = spark
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(file)
.as[T]
My case classes reflect tables read from JDBC with Option[T]
used to represent a nullable field. Option[BigDecimal]
is used to receive a Decimal field from JDBC.
I have pimped on some code to read/write from/to csv files when reading/writing on my local machine so I can easily inspect the contents.
So my next attempt was this:
var df = spark
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.schema(implicitly[Encoder[T]].schema)
.load(file)
val schema = df.schema
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
schema.foreach{ field =>
field.dataType match {
case t: DoubleType =>
df = df.withColumn(field.name,
col(field.name).cast(DecimalType(38,18)))
case _ => // do nothing
}
}
df.as[T]
Unfortunately my case class now contains all None
s rather than the values expected. If I just load the csv as a DF with inferred types all of the column values are correctly populated.
It looks like I actually have two issues.
Any help/advice would be gratefully received. Happy to adjust my approach if easily writing/reading Options/BigDecimals from csv files is problematic.
Upvotes: 4
Views: 1475
Reputation: 7336
First I would fill null values with dfB.na.fill(0.0) then I would try the next solution:
case class MyCaseClass(id: String, cost: Option[BigDecimal])
var dfB = spark.createDataset(Seq(
("a", Option(12.45)),
("b", Option(null.asInstanceOf[Double])),
("c", Option(123.33)),
("d", Option(1.3444))
)).toDF("id", "cost")
dfB
.na.fill(0.0)
.withColumn("cost", col("cost").cast(DecimalType(38,18)))
.as[MyCaseClass]
.show()
First cast the column cost into DecimalType(38,18) explicitly then retrieve the dataset[MyCaseClass]. I believe the issue here was that the spark can't convert double to BigDecimal without specifying scale-precision explicitly therefore you need first to convert it into a specific decimal type and then use it as BigDecimal.
UPDATE: I slightly modified the previous code to make possible to handle members of type Option[BigDecimal] as well
Good luck
Upvotes: 1