Reputation: 141
I have a Dataset that I'm writing out to S3 using the DataFrameWriter. I'm using Parquet and also doing a partitionBy call on a column that has 256 distinct values. It works well but takes some time to write the dataset out (and read into other jobs). In debugging, I noticed that the writer only outputs 256 files, one per suffix, despite my repartition
call specifying 256 partitions. Is there a way to increase the number of files output for each partitionBy value?
My code looks like:
myDS = myDS.repartition(256, functions.col("suffix"));
myDS.write().partitionBy("suffix").parquet(String.format(this.outputPath, "parquet", this.date));
Upvotes: 1
Views: 1159
Reputation: 141
The issue with my code was the presence of specifying a column in my repartition
call. Simply removing the column from the repartition
call fixed the issue.
The relationship between number of output files per partitionBy
value is directly related to the number of partitions. Suppose you have 256 distinct partitionBy
values. If you precede your writer with a repartition(5)
call, you'll end up with a maximum of 5 output files per partitionBy
value. Total number of output files would not exceed 1280 (though it could be less if a there is not much data for a given partitionBy value).
Upvotes: 2