Reputation: 10941

Limit max parallelism for a single RDD without decreasing the number of partitions

Is it possible to limit the max number of concurrent tasks at the RDD level without changing the actual number of partitions? The use case is to not overwhelm a database with too many concurrent connections without reducing the number of partitions. Reducing the number of partitions causes each partition to become larger and eventually unmanageable.

Upvotes: 3

Answers (3)

eje

Reputation: 945

I'm re-posting this as an "answer" because I think it may be the least-dirty hack that might get the behavior you want:

Use a mapPartitions(...) call, and at the beginning of the mapping function, do some kind of blocking check on a globally viewable state (REST-call, maybe?) that only allows some maximum number of checks to succeed at any given time. Since that will delay the full RDD operation, you may need to increase the timeout on RDD finishing to prevent an error

Upvotes: 1

eje

Reputation: 945

One approach might be to enable dynamic allocation, and set the maximum number of executors to your desired maximum parallelism.

spark.dynamicAllocation.enabled true
spark.dynamicAllocation.maxExecutors <maximum>

You can read more about configuring dynamic allocation is described here:

https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation https://spark.apache.org/docs/latest/configuration.html#scheduling

If you are trying to control one specific computation, you could experiment with programmatically controlling the number of executors:

https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-sparkcontext.adoc#dynamic-allocation

Upvotes: 0

rakesh

Reputation: 2051

Primary significance of partitioning in spark is for providing parallelism, and your requirement is to reduce parallelism!!! But the the requirement is genuine :)

What is the real problem with less number of partition? Is writing too much data at once is creating problem? If that is the case, you could breakdown the per partition writing.

Can you put the data in some intermediate queue and process the at a controlled manner?

Upvotes: 0

Limit max parallelism for a single RDD without decreasing the number of partitions

Answers (3)

Related Questions