Is DataFrame schema saved when using parquet format?

Question

If one calls df.write.parquet(destination), is the DataFrame schema (i.e. StructType information) saved along with the data?

If the parquet files are generated by other programs other than Spark, how does sqlContext.read.parquet figure out the schema of the DataFrame?

Shaido · Accepted Answer

Parquet files automatically preserves the schema of the original data when saving. So there will be no difference if it's Spark or another system that writes/reads the data.

If one or multiple columns are used to partition the data when saving, the data type for these columns are lost (since the information is stored in the file structure). The data types of these can be automatically inferred by Spark when reading (currently only numeric data types and strings are supported).

This automatic inference can be turned off by setting spark.sql.sources.partitionColumnTypeInference.enabled to false, which will make these columns be read as strings. For more information see here.

Is DataFrame schema saved when using parquet format?

Answers (1)

Related Questions