Govinda
Govinda

Reputation: 73

S3 compaction by identifying the smaller-multiple parquet files of a dataframe into single partition dataframe by using pyspark

I have requirement to read the multiple-smaller parquet files of a dataframe(s3-folder) and re-write them into same location with single/more files of size range in b/w min:128mb and Max:900mb.

Appreciate suggestions or solution on this use case.

Upvotes: 0

Views: 1195

Answers (1)

Shubham Jain
Shubham Jain

Reputation: 5536

You can do that by repartitioning the dataframe and save it s3 as

If you have the count of the dataframe then you can create multiple files as

recordsRequiredPerFIle = 100000
numOfFiles = count/recordsRequiredPerFIle
df.repartition(numOfFiles).write.parquet....

Upvotes: 0

Related Questions