Reputation: 8865

Trying to overwrite parquet file in an Azure Data Lake using PySpark in Databricks

I'm overwriting the Delta table in data bricks and overwriting the Parquet file in Azure Data lake using pyspark.

(
                     df.write
                       .format("delta")
                       .mode("overwrite")
                       .partitionBy("year_id","month_id","time_key")
                       .option("replaceWhere", "time_key = {}".format("20231020"))
                       .save(/mnt/test/new/)
)

Here Delta table is overwriting but the Parquet file is creating multiple files in the Azure Data Lake when ever I ran.

Can any one suggest me what I'm missing here.

Upvotes: 0

Answers (1)

Dileep Raj Narayan Thumula

Reputation: 3250

It seems like you are trying to overwrite a Parquet file in ADLS, but instead of overwriting the file, multiple files are being created. This could be happening because of the way you are saving the file.

When you save a DataFrame as a Parquet file, PySpark creates multiple files by default. This is because Parquet is a columnar storage format, and each file contains a subset of the columns. However, if you want to overwrite an existing Parquet file with a single file, you can set the coalesce parameter to 1 before saving the file.

I have tried an example using the coalesce parameter to 1 before saving the file.

df.coalesce(1).write.format("parquet").mode("overwrite").save(parquet_path)

As you can see after using the coalesce there is only 1 Partfile created in the ADLS.

enter image description here

Also Overwriting the Delta table:

df.write.format("delta").option("mergeSchema", "true").mode("overwrite").save(delta_path)

2nd Method:

When autoCompact is enabled, Spark will automatically execute the OPTIMIZE command to re-organize the data, resulting in more partitions if necessary

It is possible that the autoCompact option is not enabled for the Parquet file, which is why it is creating multiple files. You can try adding the autoCompact option to your code for the Parquet file as well and see if that resolves the issue.

The below is the code for autocompact Writing the data to Parquet with autoCompact option

Parquet_file_path = f"abfss://<container Name>@<storageaccountname>.dfs.core.windows.net/parquet_folder"
df.write.mode("overwrite").option("autoCompact", "true").parquet(Parquet_file_path)

Upvotes: 0

Trying to overwrite parquet file in an Azure Data Lake using PySpark in Databricks

Answers (1)

Related Questions