Reputation: 329
I am trying to figure out how to set schema for Row object. I am copied code from docs https://spark.apache.org/docs/2.4.0/api/scala/index.html#org.apache.spark.sql.types.StructType
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val innerStruct =
StructType(
StructField("f1", IntegerType, true) ::
StructField("f2", LongType, false) ::
StructField("f3", BooleanType, false) :: Nil)
val struct = StructType(
StructField("a", innerStruct, true) :: Nil)
// Create a Row with the schema defined by struct
val row = Row(Row(1, 2, true))
However schema is null
println(row, row.schema)
// ([[1,2,true]],null)
So how to set schema?
Upvotes: 0
Views: 406
Reputation: 1572
Firstly you can create an rdd and then a dataframe with schema specifying from the row as follows:
val rdd = sparkSession.sparkContext.makeRDD(Seq(row))
val dataFrame = sparkSession.createDataFrame(rdd, struct)
println(dataFrame.schema)
Will print:
StructType(StructField(a,StructType(StructField(f1,IntegerType,true), StructField(f2,LongType,false), StructField(f3,BooleanType,false)),true))
Upvotes: 0
Reputation: 11449
val df=Seq(
| (1, 2000l, true),
| (3,4500l,false)
| ).toDF("f1","f2","f3")
df: org.apache.spark.sql.DataFrame = [f1: int, f2: bigint ... 1 more field]
case class TestCase(f1: Int, f2: Long,f3: Boolean)
defined class TestCase
val encoder = org.apache.spark.sql.Encoders.product[TestCase]
encoder: org.apache.spark.sql.Encoder[TestCase] = class[f1[0]: int, f2[0]: bigint, f3[0]: boolean]
val dataset = df.as(encoder)
dataset: org.apache.spark.sql.Dataset[TestCase] = [f1: int, f2: bigint ... 1 more field]
dataset.schema
res6: org.apache.spark.sql.types.StructType = StructType(StructField(f1,IntegerType,false), StructField(f2,LongType,false), StructField(f3,BooleanType,false))
df.schema
res7: org.apache.spark.sql.types.StructType = StructType(StructField(f1,IntegerType,false), StructField(f2,LongType,false), StructField(f3,BooleanType,false))
dataset.printSchema
root
|-- f1: integer (nullable = false)
|-- f2: long (nullable = false)
|-- f3: boolean (nullable = false)
Upvotes: 0
Reputation: 113
You can try to use GenericRowWithSchema
instead of just Row and initialize it with the schema:
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
val row = new GenericRowWithSchema((1, 2, true),schema)
Upvotes: 2