Reputation: 10941
Is it possible to limit the max number of concurrent tasks at the RDD level without changing the actual number of partitions? The use case is to not overwhelm a database with too many concurrent connections without reducing the number of partitions. Reducing the number of partitions causes each partition to become larger and eventually unmanageable.
Upvotes: 3
Views: 2934
Reputation: 945
I'm re-posting this as an "answer" because I think it may be the least-dirty hack that might get the behavior you want:
Use a mapPartitions(...)
call, and at the beginning of the mapping function, do some kind of blocking check on a globally viewable state (REST-call, maybe?) that only allows some maximum number of checks to succeed at any given time. Since that will delay the full RDD operation, you may need to increase the timeout on RDD finishing to prevent an error
Upvotes: 1
Reputation: 945
One approach might be to enable dynamic allocation, and set the maximum number of executors to your desired maximum parallelism.
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.maxExecutors <maximum>
You can read more about configuring dynamic allocation is described here:
https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation https://spark.apache.org/docs/latest/configuration.html#scheduling
If you are trying to control one specific computation, you could experiment with programmatically controlling the number of executors:
Upvotes: 0
Reputation: 2051
Primary significance of partitioning in spark is for providing parallelism, and your requirement is to reduce parallelism!!! But the the requirement is genuine :)
What is the real problem with less number of partition? Is writing too much data at once is creating problem? If that is the case, you could breakdown the per partition writing.
Can you put the data in some intermediate queue and process the at a controlled manner?
Upvotes: 0