How to set parquet block size in Spark on Azure HDInsight?

I have about 3500 csv's that I convert into parquet partitioned by date (this data spans 7 days). I want to set the parquet file size such that every file is 1gb. currently I get way too many files (400-600 per day) with varying sizes between 64 to 128 MB. I can re partition (using repartition/coalesce) to x number of files per partition(day) but I still have varying file sizes depending on how much data exists in a day so day 1 may have 20 gb so 10 files is 2gb each but day 2 has 10 gb so each file is 1gb. I am looking how to set/code such that every file in every partition is 1gb. I am using pyspark and here is the code I use to write parquet files.

csv_reader_df.write.partitionBy("DateId").option("compression","snappy").parquet('hdfs://mycluster/home/sshuser/snappy_data.parquet')

Upvotes: 0

Answers (2)

Hossein Torabi

Reputation: 733

Reference to the spark documentation:

Configuration of Parquet can be done using the setConf method on SparkSession or by running SET key=value commands using SQL.

So you can set parquet.block.size by this!

Upvotes: 0

Julien Laurenceau

Reputation: 340

parquet writer will do with one file per spark partition. You have to repartition or coalesce to manage the number of files.

val PARQUET_BLOCK_SIZE: Int = 32 * 1024 * 1024
val targetNbFiles: Int = 20
csv_reader_df.coalesce(targetNbFiles).write.option("parquet.block.size",PARQUET_BLOCK_SIZE).partitionBy("DateId").option("compression","snappy").parquet('hdfs://mycluster/home/sshuser/snappy_data.parquet')

Upvotes: 2

How to set parquet block size in Spark on Azure HDInsight?

Answers (2)

Related Questions