Reputation: 8865
I'm overwriting the Delta table in data bricks and overwriting the Parquet file in Azure Data lake using pyspark.
(
df.write
.format("delta")
.mode("overwrite")
.partitionBy("year_id","month_id","time_key")
.option("replaceWhere", "time_key = {}".format("20231020"))
.save(/mnt/test/new/)
)
Here Delta table is overwriting but the Parquet file is creating multiple files in the Azure Data Lake when ever I ran.
Can any one suggest me what I'm missing here.
Upvotes: 0
Views: 1054
Reputation: 3250
It seems like you are trying to overwrite a Parquet file in ADLS, but instead of overwriting the file, multiple files are being created. This could be happening because of the way you are saving the file.
When you save a DataFrame as a Parquet file, PySpark creates multiple files by default. This is because Parquet is a columnar storage format, and each file contains a subset of the columns. However, if you want to overwrite an existing Parquet file with a single file, you can set the coalesce
parameter to 1
before saving the file.
I have tried an example using the coalesce parameter to 1 before saving the file.
df.coalesce(1).write.format("parquet").mode("overwrite").save(parquet_path)
As you can see after using the coalesce
there is only 1 Partfile created in the ADLS.
Also Overwriting the Delta table:
df.write.format("delta").option("mergeSchema", "true").mode("overwrite").save(delta_path)
2nd Method:
When autoCompact
is enabled, Spark will automatically execute the OPTIMIZE
command to re-organize the data, resulting in more partitions if necessary
It is possible that the autoCompact
option is not enabled for the Parquet file, which is why it is creating multiple files. You can try adding the autoCompact
option to your code for the Parquet file as well and see if that resolves the issue.
The below is the code for autocompact
Writing the data to Parquet with autoCompact option
Parquet_file_path = f"abfss://<container Name>@<storageaccountname>.dfs.core.windows.net/parquet_folder"
df.write.mode("overwrite").option("autoCompact", "true").parquet(Parquet_file_path)
Upvotes: 0