Reputation: 17676
How can I force a (mostly) uniform distribution?
I want to perform something like:
df.repartition(5000) // scatter
.transform(some_complex_function)
.repartition(200) // gather
.write.parquet("myresult")
Indeed, 5000 tasks are executed after the repartition step. However, the size of input files per task varies between less than 1MB and 16MB.
The data is still skewed. How can I make sure it is no longer skewed and cluster resources are used efficiently.
I learnt, that this is due to the usage of complex type columns i.e. arrays. Also note, that the some_complex_function
operates on this column i.e. its complexity increases with the number of elements inside the array.
Is there a way to partition better for such a case?
Upvotes: 1
Views: 1809
Reputation: 27373
repartition
should distribute the number of records uniformly, you can verify that using the techniques listed here : Apache Spark: Get number of records per partition
If your record contain some complex data structures, or strings of various lengths, then the number of bytes per partition will not be equal. I asked for a solution to this problem here : How to (equally) partition array-data in spark dataframe
Upvotes: 1