How to write parquet file of size 1GB in python databricks

Question

I am trying to write parquet file in ADLS Gen2 using python notebook in Azure databricks.


df.write.partitionBy(f"date").mode("overwrite").parquet(target_file_path)

When the files are writeen with the date partion, for each parttion I see multiple smaller size files are being written. I want to write files with filesizes of 1GB so that when reading the files reading will be faster.For example if the total size is 4GB then 4 1 GB files instead of 20 200 Mb Files and if its less than 1 GB then 1 GB File.

One way is to estimate the size of data and do a reparttion or coalesce. But its consuming too much time. Is there any other way or settings that can be applied so that files will be directly written with 1 GB size rather than smaller files.

Tried the below but they don't work too. What else I can try?


spark = SparkSession.builder \  
    .appName("ParquetFileSizeControl") \  
    .config("spark.sql.files.maxPartitionBytes", str(1024 * 1024 * 1024))  # 1GB  
    .config("parquet.block.size", str(1024 * 1024 * 1024))  # 1GB (optional)  
    .getOrCreate() 

df.write.partitionBy(f"date").mode("overwrite").parquet(target_file_path)

How to write parquet file of size 1GB in python databricks

Answers (0)

Related Questions