Reputation: 4928
Using PySpark 2.4.7
My goal is to write PySpark DataFrame into specific number of parquet files into AWS S3.
Say I want to write PySpark DataFrame into 10 parquet files. This is my approach
df.repartition(10).write.mode("append").partitionBy(partition_cols).saveAsTable(<db.table>)
This writes 10 parquet files for each partition bucket in S3. I want to write 10 (approx) in total across all partition columns, How can I achieve this?
It could be achieved as Abdennacer Lachiheb
mentioned in the answer. However as per my comment, there are imbalance within partition columns therefore simply dividing total number of files into number of partition columns wouldn't be optimal answer. In such case I would end up with 5 files with 10mb per file for train
partition and 5 files with 1mb per file for valid
partition. I want it to have similar file sizes. Furthermore I could achieve this by using stratifying it but wondering if there are simpler way of achieving it.
Upvotes: 0
Views: 1267
Reputation: 133
As I understood from question and comments you have data partitions like following
/date=yyyyMMdd/
|-/train/
|-/validation/
and for train and validation you want files numbers based on the data size of these. Lets say you have 1L records. Of which 70k are training & 30k are validation. In this case if you set the maxRecordsPerFile=10000
you will have 7 files for train and 3 files for validation.
Upvotes: 0
Reputation: 1857
You can use coalesce
when writing the dataframe out:
df.coalesce(partition_count).write.parquet(<storage account path>)
Upvotes: 0
Reputation: 4888
We know that the number of files is the sum of all files across all partitions:
nb_all_files = nb_files_per_partitions * len(partition_cols)
so then nb_files_per_partitions = nb_all_files / len(partition_cols)
so in your case:
nb_files_per_partitions = 10 / len(partition_cols)
Final result:
df.repartition(10/len(partition_cols)).write.mode("append").partitionBy(partition_cols).saveAsTable(<db.table>)
Upvotes: 1