Reputation: 95
Given the two case classes:
case class Response(
responseField: String
...
items: List[Item])
case class Item(
itemField: String
...)
I am creating a Response
dataset:
val dataset = spark.read.format("parquet")
.load(inputPath)
.as[Response]
.map(x => x)
The issue arises when itemField
is not present in any of the rows and spark will raise this error org.apache.spark.sql.AnalysisException: No such struct field itemField
. If itemField
was not nested I could handle it by doing dataset.withColumn("itemField", lit(""))
. Is it possible to do the same within the List
field?
Upvotes: 0
Views: 1590
Reputation: 1623
I assume the following:
Data was written with the following schema:
case class Item(itemField: String)
case class Response(responseField: String, items: List[Item])
Seq(Response("a", List()), Response("b", List())).toDF.write.parquet("/tmp/structTest")
Now schema changed to:
case class Item(itemField: String, newField: Int)
case class Response(responseField: String, items: List[Item])
spark.read.parquet("/tmp/structTest").as[Response].map(x => x) // Fails
For Spark 2.4 please see: Spark - How to add an element to an array of structs
For Spark 2.3 this should work:
val addNewField: (Array[String], Array[Int]) => Array[Item] = (itemFields, newFields) => itemFields.zip(newFields).map { case (i, n) => Item(i, n) }
val addNewFieldUdf = udf(addNewField)
spark.read.parquet("/tmp/structTest")
.withColumn("items", addNewFieldUdf(
col("items.itemField") as "itemField",
array(lit(1)) as "newField"
)).as[Response].map(x => x) // Works
Upvotes: 2