Reputation: 20372
W.r.t. following code:
spark.sql(sqlStatement).write.partitionBy("city", "dataset", "origin").mode(SaveMode.Overwrite).parquet(rootPath)
It deletes everything under the rootPath
before writing data to it. If the code is changed to:
spark.sql(sqlStatement).write.partitionBy("city", "dataset", "origin").mode(SaveMode.Append).parquet(rootPath)
then it does not delete anything. What we want is a mode that will not delete the data under rootPath
but delete the data under a city/dataset/origin
before writing to it. How can this be done?
Upvotes: 2
Views: 3523
Reputation: 37019
Have a look at spark.sql.sources.partitionOverwriteMode="dynamic"
setting, which was introduced in Spark 2.3.0.
Upvotes: 0
Reputation: 3619
Try basepath option. Partition discovery will be only pointed towards children of '/city/dataset/origin'
according to documentation -
Spark SQL’s partition discovery has been changed to only discover partition directories that are children of the given path. (i.e. if path="/my/data/x=1" then x=1 will no longer be considered a partition but only children of x=1.) This behavior can be overridden by manually specifying the basePath that partitioning discovery should start with (SPARK-11678).
spark.sql(sqlStatement)\
.write.partitionBy("city", "dataset","origin")\
.option("basePath","/city/dataset/origin") \
.mode(SaveMode.Append).parquet(rootPath)
let me know if this doesnt work. I'll remove my answer.
Upvotes: 1