Reputation: 842
I want to do some Http requests from inside a Spark job to a rate limited API. In order to keep track of the number of concurrent requests in a non-distributed system (in Scala), following works:
Akka
is distributed, there are issues to (de)serialize the actorSystem
in a distributed Spark context. collect
the dataframes to the Spark driver
and handle throttling there with one of above options, but I would like to keep this distributed.How are such things typically handled?
Upvotes: 7
Views: 2564
Reputation: 3990
You shouldn't try to synchronise requests across Spark executors/partitions. This is totally against Spark concurrency model.
Instead, for example, divide the global rate limit R by Executors * Cores and use mapPatitions
to send requests
from each partition within its R/(e*c) rate limit.
Upvotes: 9