Georg Heiler
Georg Heiler

Reputation: 17676

spark repartition is not uniform, still skewed

How can I force a (mostly) uniform distribution?

I want to perform something like:

df.repartition(5000) // scatter
.transform(some_complex_function)
.repartition(200) // gather
.write.parquet("myresult")

Indeed, 5000 tasks are executed after the repartition step. However, the size of input files per task varies between less than 1MB and 16MB.

The data is still skewed. How can I make sure it is no longer skewed and cluster resources are used efficiently.

edit

I learnt, that this is due to the usage of complex type columns i.e. arrays. Also note, that the some_complex_function operates on this column i.e. its complexity increases with the number of elements inside the array.

Is there a way to partition better for such a case?

Upvotes: 1

Views: 1809

Answers (1)

Raphael Roth
Raphael Roth

Reputation: 27373

repartition should distribute the number of records uniformly, you can verify that using the techniques listed here : Apache Spark: Get number of records per partition

If your record contain some complex data structures, or strings of various lengths, then the number of bytes per partition will not be equal. I asked for a solution to this problem here : How to (equally) partition array-data in spark dataframe

Upvotes: 1

Related Questions