Reputation: 3667
From the coarse grained feature of Spark, when running a Spark job containing Join or Reduce by key etc. is it a good idea to change spark.default.parallelism to a big number so that lots of threads can work on a single partition concurrently?
To my understanding this should be fine, right? But the downside is this might make network I/O traffic busy. The default is the amount of all available cores
Anyone can give some comments on this? thanks in advance
Upvotes: 0
Views: 1690
Reputation: 330413
so that lots of threads can work on a single partition concurrently
Partition is the smallest unit of concurrency in Spark. It means a single thread per partition. You can of course use parallel processing inside mapPartitions
but it is not a part of a standard Spark logic.
Higher parallelism means more partitions when number of partitions is not specified otherwise. Usually it is a desired outcome but it comes with a price. It means a growing cost of bookkeeping, less efficient aggregations and generally speaking less data that can be processed locally without serialization/deserialization and network traffic. It can become a serious problem when number of partitions grows when number of partitions is very high compared to the amount of data and number of available cores (see Spark iteration time increasing exponentially when using join).
When it makes sense to increase parallelism:
When it doesn't makes sense to increase parallelism:
groupBy
, reduce
, agg
).Generally speaking I believe that spark.default.parallelism
is not a very useful tool and it makes more sense to adjust parallelism on case by case basis. If parallelism is too high it can result in empty partitions in case of data loading and simple transformations and reduced performance / suboptimal resource usage. If it is too low it can lead to problems when you perform transformations which may require a large number of partitions (joins, unions).
Upvotes: 3