What's the best way to rate limit a spark application

Question

I have an application does the following:

Reads URLs from a Hive table
Creates HTTP requests from those URLs, hits a server with them and parses the responses
Writes the parsed responses to another Hive table

I would like to rate-limit the URLs sent to the server. Currently, to solve the problem I have added a sleep time after every request is sent to the server. The sleep time is calculated as: (no. of executors) * (no. of cores available for each executor) / (RPS intended)

This for some reason does not do any rate limiting, so I am looking for alternatives. From what I have found from this post, it seems Spark Streaming could be a good alternative if I could use the input Hive table as a streaming source and rate limit the reading.

I have read the documents but can not figure out if a Hive table can be a streaming source. A file can be a streaming source, so I can always read the data from the hive table, store it in a file and then use that as a streaming source but I was wondering if it was possible to avoid this indirect route.

What's the best way to rate limit a spark application

Answers (1)

Related Questions

What&#39;s the best way to rate limit a spark application

Answers (1)

Related Questions

What's the best way to rate limit a spark application