Reputation: 73
I have requirement to read the multiple-smaller parquet files of a dataframe(s3-folder) and re-write them into same location with single/more files of size range in b/w min:128mb and Max:900mb.
Appreciate suggestions or solution on this use case.
Upvotes: 0
Views: 1195
Reputation: 5536
You can do that by repartitioning the dataframe and save it s3 as
If you have the count of the dataframe then you can create multiple files as
recordsRequiredPerFIle = 100000
numOfFiles = count/recordsRequiredPerFIle
df.repartition(numOfFiles).write.parquet....
Upvotes: 0