Reputation: 55
I am trying to write parquet file in ADLS Gen2 using python notebook in Azure databricks.
df.write.partitionBy(f"date").mode("overwrite").parquet(target_file_path)
When the files are writeen with the date partion, for each parttion I see multiple smaller size files are being written. I want to write files with filesizes of 1GB so that when reading the files reading will be faster.For example if the total size is 4GB then 4 1 GB files instead of 20 200 Mb Files and if its less than 1 GB then 1 GB File.
One way is to estimate the size of data and do a reparttion or coalesce. But its consuming too much time. Is there any other way or settings that can be applied so that files will be directly written with 1 GB size rather than smaller files.
Tried the below but they don't work too. What else I can try?
spark = SparkSession.builder \
.appName("ParquetFileSizeControl") \
.config("spark.sql.files.maxPartitionBytes", str(1024 * 1024 * 1024)) # 1GB
.config("parquet.block.size", str(1024 * 1024 * 1024)) # 1GB (optional)
.getOrCreate()
df.write.partitionBy(f"date").mode("overwrite").parquet(target_file_path)
Upvotes: 0
Views: 35