Rui
Rui

Reputation: 3667

Spark YARN > spark.default.parallelism > From the coarse grained point of view, bigger or smaller

From the coarse grained feature of Spark, when running a Spark job containing Join or Reduce by key etc. is it a good idea to change spark.default.parallelism to a big number so that lots of threads can work on a single partition concurrently?

To my understanding this should be fine, right? But the downside is this might make network I/O traffic busy. The default is the amount of all available cores

Anyone can give some comments on this? thanks in advance

Upvotes: 0

Views: 1690

Answers (1)

zero323
zero323

Reputation: 330413

so that lots of threads can work on a single partition concurrently

Partition is the smallest unit of concurrency in Spark. It means a single thread per partition. You can of course use parallel processing inside mapPartitions but it is not a part of a standard Spark logic.

Higher parallelism means more partitions when number of partitions is not specified otherwise. Usually it is a desired outcome but it comes with a price. It means a growing cost of bookkeeping, less efficient aggregations and generally speaking less data that can be processed locally without serialization/deserialization and network traffic. It can become a serious problem when number of partitions grows when number of partitions is very high compared to the amount of data and number of available cores (see Spark iteration time increasing exponentially when using join).

When it makes sense to increase parallelism:

  • you have large amount of data and a lot of spare resources (recommend number of partitions is twice a number of available cores).
  • you want to reduce amount of memory required to process a single partition.
  • you perform computationally intensive tasks.

When it doesn't makes sense to increase parallelism:

  • parallelism >> number of available cores.
  • parallelism is high compared to amount of data and you want to process more than one record at the time (groupBy, reduce, agg).

Generally speaking I believe that spark.default.parallelism is not a very useful tool and it makes more sense to adjust parallelism on case by case basis. If parallelism is too high it can result in empty partitions in case of data loading and simple transformations and reduced performance / suboptimal resource usage. If it is too low it can lead to problems when you perform transformations which may require a large number of partitions (joins, unions).

Upvotes: 3

Related Questions