Throttle concurrent HTTP requests from Spark executors

Question

I want to do some Http requests from inside a Spark job to a rate limited API. In order to keep track of the number of concurrent requests in a non-distributed system (in Scala), following works:

a throttling actor which maintains a semaphore (counter) which increments when the request starts and decrements when the request completes. Although Akka is distributed, there are issues to (de)serialize the actorSystem in a distributed Spark context.
using parallel streams with fs2: https://fs2.io/concurrency-primitives.html => cannot be distributed.
I suppose I could also just collect the dataframes to the Spark driver and handle throttling there with one of above options, but I would like to keep this distributed.

How are such things typically handled?

shay__ · Accepted Answer

You shouldn't try to synchronise requests across Spark executors/partitions. This is totally against Spark concurrency model.

Instead, for example, divide the global rate limit R by Executors * Cores and use mapPatitions to send requests from each partition within its R/(e*c) rate limit.

Throttle concurrent HTTP requests from Spark executors

Answers (1)

Related Questions