Ged
Ged

Reputation: 18108

Spark Scala Int vs Integer for Option vs StructType

Why is that for a case class I can do

fieldn: Option[Int]

or

fieldn: Option[Integer]

but for StructType I must use?

StructField("fieldn", IntegerType, true),

Upvotes: 0

Views: 1681

Answers (2)

Kit Menke
Kit Menke

Reputation: 7056

I understand why it seems inconsistent - the reason is convenience. It is more convenient to give Spark a case class because they are really easy to work with in Scala.

Behind the scenes, Spark is taking the case class you give it and using it to determine the schema for your DataFrame. This means that all Java/Scala types will be converted to Spark SQL's types behind the scenes. For example, for the following case class:

case class TestIntConversion(javaInteger: java.lang.Integer, scalaInt: scala.Int, scalaOptionalInt: Option[scala.Int])

You get a schema like this:

root
 |-- javaInteger: integer (nullable = true)
 |-- scalaInt: integer (nullable = false)
 |-- scalaOptionalInt: integer (nullable = true)

In the latest version of Spark, the thing that does the conversion for you is an Encoder. You can see a ton of the conversions in ExpressionEncoderSuite

Upvotes: 2

user11162301
user11162301

Reputation: 11

Optional type denotes objects that can be undefined (None). So it is mostly applicable to data.

There is no position at which it could be meaningfully used in your StructField example:

Schema must be defined so

Option[StructField]  

and doesn't provide any information about the type, not is semantically truthful and anything around

Option[DataType] 

or

Option[IntegerType]

i.e

StructField("fieldn", Some(IntegerType): )

would make even less sense - either creating object with unclear semantics (former) or impossible API.

Fundamentally StructType represent obligatory metadata. It cannot be missing by design, and because of that Option doesn't have any place there.

Upvotes: 1

Related Questions