morpheus
morpheus

Reputation: 20372

Spark: How to overwrite data in partitions but not the root folder while saving to disk?

W.r.t. following code:

spark.sql(sqlStatement).write.partitionBy("city", "dataset", "origin").mode(SaveMode.Overwrite).parquet(rootPath)

It deletes everything under the rootPath before writing data to it. If the code is changed to:

spark.sql(sqlStatement).write.partitionBy("city", "dataset", "origin").mode(SaveMode.Append).parquet(rootPath)

then it does not delete anything. What we want is a mode that will not delete the data under rootPath but delete the data under a city/dataset/origin before writing to it. How can this be done?

Upvotes: 2

Views: 3523

Answers (2)

Michael Spector
Michael Spector

Reputation: 37019

Have a look at spark.sql.sources.partitionOverwriteMode="dynamic" setting, which was introduced in Spark 2.3.0.

Upvotes: 0

Pushkr
Pushkr

Reputation: 3619

Try basepath option. Partition discovery will be only pointed towards children of '/city/dataset/origin'

according to documentation -

Spark SQL’s partition discovery has been changed to only discover partition directories that are children of the given path. (i.e. if path="/my/data/x=1" then x=1 will no longer be considered a partition but only children of x=1.) This behavior can be overridden by manually specifying the basePath that partitioning discovery should start with (SPARK-11678).

spark.sql(sqlStatement)\
.write.partitionBy("city", "dataset","origin")\
.option("basePath","/city/dataset/origin") \
.mode(SaveMode.Append).parquet(rootPath)

let me know if this doesnt work. I'll remove my answer.

Upvotes: 1

Related Questions