David Griffin
David Griffin

Reputation: 13927

Generate a Spark StructType / Schema from a case class

If I wanted to create a StructType (i.e. a DataFrame.schema) out of a case class, is there a way to do it without creating a DataFrame? I can easily do:

case class TestCase(id: Long)
val schema = Seq[TestCase]().toDF.schema

But it seems overkill to actually create a DataFrame when all I want is the schema.

(If you are curious, the reason behind the question is that I am defining a UserDefinedAggregateFunction, and to do so you override a couple of methods that return StructTypes and I use case classes.)

Upvotes: 59

Views: 32535

Answers (5)

kanielc
kanielc

Reputation: 1322

Here's a nice generic function that will support any case class you create.

import org.apache.spark.sql.Encoder
import org.apache.spark.sql.types.StructType
import scala.reflect.runtime.universe.TypeTag
import org.apache.spark.sql.catalyst.ScalaReflection

def schemaFor[T : Encoder : TypeTag]: StructType = ScalaReflection.schemaFor[T].dataType.asInstanceOf[StructType]

In your case, you'd call it like

case class TestCase(id: Long)
schemaFor[TestCase]

With the shell output looking like:

scala> schemaFor[TestCase]
res11: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false))

Upvotes: 1

huon
huon

Reputation: 102306

Instead of manually reproducing the logic for creating the implicit Encoder object that gets passed to toDF, one can use that directly (or, more precisely, implicitly in the same way as toDF):

// spark: SparkSession

import spark.implicits._

implicitly[Encoder[MyCaseClass]].schema

Unfortunately, this actually suffers from the same problem as using org.apache.spark.sql.catalyst or Encoders as in the other answers: the Encoder trait is experimental.

How does this work? The toDF method on Seq comes from a DatasetHolder, which is created via the implicit localSeqToDatasetHolder that is imported via spark.implicits._. That function is defined like:

implicit def localSeqToDatasetHolder[T](s: Seq[T])(implicit arg0: Encoder[T]): DatasetHolder[T]

As you can see, it takes an implicit Encoder[T] argument, which, for a case class, can be computed via newProductEncoder (also imported via spark.implicits._). We can reproduce this implicit logic to get an Encoder for our case class, via the convenience scala.Predef.implicitly (in scope by default, because it's from Predef) that will just returns its requested implicit argument:

def implicitly[T](implicit e: T): T

Upvotes: 5

Kurt
Kurt

Reputation: 771

I know this question is almost a year old but I came across it and thought others who do also might want to know that I have just learned to use this approach:

import org.apache.spark.sql.Encoders
val mySchema = Encoders.product[MyCaseClass].schema

Upvotes: 77

Art
Art

Reputation: 1340

in case someone wants to do this for a custom Java bean:

ExpressionEncoder.javaBean(Event.class).schema().json()

Upvotes: 13

Tzach Zohar
Tzach Zohar

Reputation: 37852

You can do it the same way SQLContext.createDataFrame does it:

import org.apache.spark.sql.catalyst.ScalaReflection
val schema = ScalaReflection.schemaFor[TestCase].dataType.asInstanceOf[StructType]

Upvotes: 91

Related Questions