Reputation: 1212
If one calls df.write.parquet(destination)
, is the DataFrame schema (i.e. StructType
information) saved along with the data?
If the parquet files are generated by other programs other than Spark, how does sqlContext.read.parquet
figure out the schema of the DataFrame?
Upvotes: 1
Views: 3951
Reputation: 28322
Parquet files automatically preserves the schema of the original data when saving. So there will be no difference if it's Spark or another system that writes/reads the data.
If one or multiple columns are used to partition the data when saving, the data type for these columns are lost (since the information is stored in the file structure). The data types of these can be automatically inferred by Spark when reading (currently only numeric data types and strings are supported).
This automatic inference can be turned off by setting spark.sql.sources.partitionColumnTypeInference.enabled
to false, which will make these columns be read as strings. For more information see here.
Upvotes: 1