How to represent nulls in DataSets consisting of list of case classes

Question

I have a case class

final case class FieldStateData(
    job_id: String = null,
    job_base_step_id: String = null,
    field_id: String = null,
    data_id: String = null,
    data_value: String = null,
    executed_unit: String = null,
    is_doc: Boolean = null,
    mime_type: String = null,
    filename: String = null,
    filesize: BigInt = null,
    caption: String = null,
    executor_id: String = null,
    executor_name: String = null,
    executor_email: String = null,
    created_at: BigInt = null
)

That I want to use as part of a dataset of type Dataset[FieldStateData] to eventually insert into a database. All columns need to be nullable. How would I represent null types for numbers descended from Any rather than any string? I thought about using Option[Boolean] or something like that but will that automatically unbox during insertion or when it's used as a sql query?

Also note that the above code in not correct. Boolean types are not nullable. It's just an example.

Yayati Sule · Accepted Answer

You are correct to use Option Monad for in the case class. The field shall be unboxed by spark on read.

import org.apache.spark.sql.{Encoder, Encoders, Dataset}

final case class FieldStateData(job_id: Option[String],
                                job_base_step_id: Option[String],
                                field_id: Option[String],
                                data_id: Option[String],
                                data_value: Option[String],
                                executed_unit: Option[String],
                                is_doc: Option[Boolean],
                                mime_type: Option[String],
                                filename: Option[String],
                               filesize: Option[BigInt],
                               caption: Option[String],
                               executor_id: Option[String],
                               executor_name: Option[String],
                               executor_email: Option[String],
                               created_at: Option[BigInt])
implicit val fieldCodec: Encoder[FieldStateData] = Encoders.product[FieldStateData]

val ds: Dataset[FieldStateEncoder] = spark.read.source_name.as[FieldStateData]

When you write the Dataset back into the database, None become null values and Some(x) are the values that are present.

How to represent nulls in DataSets consisting of list of case classes

Answers (1)

Related Questions