Reputation: 1017
Is it possible to save an empty DataFrame with a known schema such that the schema is written to the file, even though it has 0 records?
def example(spark: SparkSession, path: String, schema: StructType) = {
val dataframe = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
val dataframeWriter = dataframe.write.mode(SaveMode.Overwrite).format("parquet")
dataframeWriter.save(path)
spark.read.load(path) // ERROR!! No files to read, so schema unknown
}
Upvotes: 5
Views: 10856
Reputation: 61
I got a similar problem with Spark 2.1.0. I solved it using repartition before writing.
df.repartition(1).write.parquet("my/path")
Upvotes: 5
Reputation: 1017
This is the answer I received from Databricks Support:
This is actually a known issue in Spark. There is already fix done in opensource JIRA -> https://issues.apache.org/jira/browse/SPARK-23271. For more details on how this behavior will change from 2.4 please check this doc change https://github.com/apache/spark/pull/20525/files#diff-d8aa7a37d17a1227cba38c99f9f22511R1808 The behavior will be changed from Spark 2.4. Until then you need to go with any one of the following ways
- Save a dataframe with at-least one record to preserve its schema
- Save schema in a JSON file and use later
Upvotes: 6