Write pyspark dataframe into specific number of parquet files in total across all partition columns

Using PySpark 2.4.7

My goal is to write PySpark DataFrame into specific number of parquet files into AWS S3.

Say I want to write PySpark DataFrame into 10 parquet files. This is my approach

df.repartition(10).write.mode("append").partitionBy(partition_cols).saveAsTable(<db.table>)

This writes 10 parquet files for each partition bucket in S3. I want to write 10 (approx) in total across all partition columns, How can I achieve this?

It could be achieved as Abdennacer Lachiheb mentioned in the answer. However as per my comment, there are imbalance within partition columns therefore simply dividing total number of files into number of partition columns wouldn't be optimal answer. In such case I would end up with 5 files with 10mb per file for train partition and 5 files with 1mb per file for valid partition. I want it to have similar file sizes. Furthermore I could achieve this by using stratifying it but wondering if there are simpler way of achieving it.

Upvotes: 0

Answers (3)

ASR

Reputation: 133

As I understood from question and comments you have data partitions like following

/date=yyyyMMdd/
     |-/train/
     |-/validation/

and for train and validation you want files numbers based on the data size of these. Lets say you have 1L records. Of which 70k are training & 30k are validation. In this case if you set the maxRecordsPerFile=10000 you will have 7 files for train and 3 files for validation.

Upvotes: 0

ARCrow

Reputation: 1857

You can use coalesce when writing the dataframe out:

df.coalesce(partition_count).write.parquet(<storage account path>)

Upvotes: 0

Abdennacer Lachiheb

Reputation: 4888

We know that the number of files is the sum of all files across all partitions:

nb_all_files = nb_files_per_partitions * len(partition_cols)

so then nb_files_per_partitions = nb_all_files / len(partition_cols)

so in your case:

nb_files_per_partitions = 10 / len(partition_cols)

Final result:

df.repartition(10/len(partition_cols)).write.mode("append").partitionBy(partition_cols).saveAsTable(<db.table>)

Upvotes: 1

Write pyspark dataframe into specific number of parquet files in total across all partition columns

Answers (3)

Related Questions