Spark SQL - Read csv into Dataset[T] where T is a case class of Option[BigDecimal] field

Question

I have previously written a Dataset[T] to a csv file.

In this case T is a case class that contains field x: Option[BigDecimal]

When I attempt to load the file back into a Dataset[T] I see the following error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `x` from double to decimal(38,18) as it may truncate.

I guess the reason is that the inferred schema contains a double rather than BigDecimal column. Is there a way around this issue? I wish to avoid casting based on column name because the read code is part of a generic function. My read code is below:

   val a = spark
    .read
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load(file)
    .as[T]

My case classes reflect tables read from JDBC with Option[T] used to represent a nullable field. Option[BigDecimal] is used to receive a Decimal field from JDBC.

I have pimped on some code to read/write from/to csv files when reading/writing on my local machine so I can easily inspect the contents.

So my next attempt was this:

   var df = spark
    .read
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .schema(implicitly[Encoder[T]].schema)
    .load(file)

  val schema = df.schema

  import org.apache.spark.sql.functions._
  import org.apache.spark.sql.types._

  schema.foreach{ field =>
    field.dataType match {
      case t: DoubleType =>
        df = df.withColumn(field.name, 
          col(field.name).cast(DecimalType(38,18)))
      case _ => // do nothing
    }
  }

  df.as[T]

Unfortunately my case class now contains all Nones rather than the values expected. If I just load the csv as a DF with inferred types all of the column values are correctly populated.

It looks like I actually have two issues.

Conversion from Double -> BigDecimal.
Nullable fields are not being wrapped in Options.

Any help/advice would be gratefully received. Happy to adjust my approach if easily writing/reading Options/BigDecimals from csv files is problematic.

Spark SQL - Read csv into Dataset[T] where T is a case class of Option[BigDecimal] field

Answers (1)

Related Questions