Reputation: 1344
I have been using an excellent answer to a question posted on SE here to determine the number of partitions, and the distribution of partitions across a dataframe Need to Know Partitioning Details in Dataframe Spark
Can someone help me expand on answers to determine the partition size of dataframe?
Thanks
Upvotes: 5
Views: 13457
Reputation: 1951
Tuning the partition size is inevitably, linked to tuning the number of partitions. There're at least 3 factors to consider in this scope:
A "good" high level of parallelism is important, so you may want to have a big number of partitions, resulting in a small partition size.
However, there is an upper bound of the number due to the following 3rd point - distribution overhead. Nevertheless, it's still ranked priority #1, so let's say if you have to make a mistake, start with the side of high level of parallelism.
Generally, it's recommended 2 to 4 tasks per core.
In general, we recommend 2-3 tasks per CPU core in your cluster.
We recommend using three to four times more partitions than there are cores in your cluster
If partition size is very large (e.g. > 1 GB), you may have issues such as garbage collection, out of memory error, etc., especially when there's shuffle operation, as per Spark doc:
Sometimes, you will get an OutOfMemoryError, not because your RDDs don’t fit in memory, but because the working set of one of your tasks, such as one of the reduce tasks in groupByKey, was too large. Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large...
Hence here comes another pros of big number of partitions (or, small partition size).
Distributed computing comes with overhead, so you can't go to an extreme either. If each task takes less than 100ms to execute, the application might have remarkable overhead due to:
, in which case you may lower the level of parallelism and increase partition size a bit.
Take-away
Empirically, people usually try with 100-1000MB per partition, so why not start with that? And remember that the number may need to re-tuned along the time..
Upvotes: 8