Reputation: 23
I have about 3500 csv's that I convert into parquet partitioned by date (this data spans 7 days). I want to set the parquet file size such that every file is 1gb. currently I get way too many files (400-600 per day) with varying sizes between 64 to 128 MB. I can re partition (using repartition/coalesce) to x number of files per partition(day) but I still have varying file sizes depending on how much data exists in a day so day 1 may have 20 gb so 10 files is 2gb each but day 2 has 10 gb so each file is 1gb. I am looking how to set/code such that every file in every partition is 1gb. I am using pyspark and here is the code I use to write parquet files.
csv_reader_df.write.partitionBy("DateId").option("compression","snappy").parquet('hdfs://mycluster/home/sshuser/snappy_data.parquet')
Upvotes: 0
Views: 3042
Reputation: 733
Reference to the spark documentation:
Configuration of Parquet can be done using the setConf method on SparkSession or by running SET key=value commands using SQL.
So you can set parquet.block.size
by this!
Upvotes: 0
Reputation: 340
parquet writer will do with one file per spark partition. You have to repartition or coalesce to manage the number of files.
val PARQUET_BLOCK_SIZE: Int = 32 * 1024 * 1024
val targetNbFiles: Int = 20
csv_reader_df.coalesce(targetNbFiles).write.option("parquet.block.size",PARQUET_BLOCK_SIZE).partitionBy("DateId").option("compression","snappy").parquet('hdfs://mycluster/home/sshuser/snappy_data.parquet')
Upvotes: 2