org.apache.spark.sql.AnalysisException: No such struct field

Question

I am reading a parquet file like this using Java Spark

Dataset myDataDS = sparkSession.read().parquet(myParquetFile)
                        .as(Encoders.bean(MyData.class));

It worked fine if myParquetFile schema is exactly according to the class MyData however let's say if I add a new field e.g. myId (even though it's value is null) to MyData class then I need to regenerate the parquet file otherwise it will throw the exception like

Caused by: org.apache.spark.sql.AnalysisException: No such struct field myId

Is there a way I can skip the null values to get pass this error without regenerating the parquet file?

Vincent Doba · Accepted Answer

When reading parquet, by default, Spark use the schema contained in the parquet files to read data. As, contrary to Avro format for instance, the schema is in the parquet files, you must regenerate the parquet files if you want to change schema.

However, instead of letting Spark inferring the schema, you can provide the schema to Spark's DataFrameReader by using method .schema(). In this case, Spark will ignore the schema defined in parquet files and use the schema you provided.

So, the solution is to pass the schema extracted from your casting class to Spark's DataFrameReader:

Dataset myDataDS = sparkSession.read()
    .schema(Encoders.bean(MyData.class).schema())
    .parquet(myParquetFile)
    .as(Encoders.bean(MyData.class))

The AnalysisException is not thrown and you get a dataset with a column "myId" set to null.

org.apache.spark.sql.AnalysisException: No such struct field

Answers (2)

MyData.java

Related Questions