Change number of output files using DataFrameWriter in Spark

Question

I have a Dataset that I'm writing out to S3 using the DataFrameWriter. I'm using Parquet and also doing a partitionBy call on a column that has 256 distinct values. It works well but takes some time to write the dataset out (and read into other jobs). In debugging, I noticed that the writer only outputs 256 files, one per suffix, despite my repartition call specifying 256 partitions. Is there a way to increase the number of files output for each partitionBy value?

My code looks like:

myDS = myDS.repartition(256, functions.col("suffix"));
myDS.write().partitionBy("suffix").parquet(String.format(this.outputPath, "parquet", this.date));

Change number of output files using DataFrameWriter in Spark

Answers (1)

Related Questions