Makar Nikitin
Makar Nikitin

Reputation: 329

Spark: schema is null

I am trying to figure out how to set schema for Row object. I am copied code from docs https://spark.apache.org/docs/2.4.0/api/scala/index.html#org.apache.spark.sql.types.StructType

 import org.apache.spark.sql._
 import org.apache.spark.sql.types._

 val innerStruct =
   StructType(
     StructField("f1", IntegerType, true) ::
     StructField("f2", LongType, false) ::
     StructField("f3", BooleanType, false) :: Nil)

 val struct = StructType(
   StructField("a", innerStruct, true) :: Nil)

 // Create a Row with the schema defined by struct
 val row = Row(Row(1, 2, true))

However schema is null

println(row, row.schema)
// ([[1,2,true]],null)

So how to set schema?

Upvotes: 0

Views: 406

Answers (3)

Duelist
Duelist

Reputation: 1572

Firstly you can create an rdd and then a dataframe with schema specifying from the row as follows:

val rdd = sparkSession.sparkContext.makeRDD(Seq(row))
val dataFrame = sparkSession.createDataFrame(rdd, struct)
println(dataFrame.schema)

Will print:

StructType(StructField(a,StructType(StructField(f1,IntegerType,true), StructField(f2,LongType,false), StructField(f3,BooleanType,false)),true))

Upvotes: 0

vaquar khan
vaquar khan

Reputation: 11449

  val df=Seq(
             |  (1, 2000l, true),
             |  (3,4500l,false)
             | ).toDF("f1","f2","f3")
        df: org.apache.spark.sql.DataFrame = [f1: int, f2: bigint ... 1 more field]

         case class TestCase(f1: Int, f2: Long,f3: Boolean)
        defined class TestCase

        val encoder = org.apache.spark.sql.Encoders.product[TestCase]
        encoder: org.apache.spark.sql.Encoder[TestCase] = class[f1[0]: int, f2[0]: bigint, f3[0]: boolean]

         val dataset = df.as(encoder)
        dataset: org.apache.spark.sql.Dataset[TestCase] = [f1: int, f2: bigint ... 1 more field]

        dataset.schema
        res6: org.apache.spark.sql.types.StructType = StructType(StructField(f1,IntegerType,false), StructField(f2,LongType,false), StructField(f3,BooleanType,false))

        df.schema
        res7: org.apache.spark.sql.types.StructType = StructType(StructField(f1,IntegerType,false), StructField(f2,LongType,false), StructField(f3,BooleanType,false))



         dataset.printSchema
        root
         |-- f1: integer (nullable = false)
         |-- f2: long (nullable = false)
         |-- f3: boolean (nullable = false)

Upvotes: 0

psyduck
psyduck

Reputation: 113

You can try to use GenericRowWithSchema instead of just Row and initialize it with the schema:

import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
val row = new GenericRowWithSchema((1, 2, true),schema)

Upvotes: 2

Related Questions