Reputation: 169
I have an application does the following:
I would like to rate-limit the URLs sent to the server. Currently, to solve the problem I have added a sleep time after every request is sent to the server. The sleep time is calculated as: (no. of executors) * (no. of cores available for each executor) / (RPS intended)
This for some reason does not do any rate limiting, so I am looking for alternatives. From what I have found from this post, it seems Spark Streaming could be a good alternative if I could use the input Hive table as a streaming source and rate limit the reading.
I have read the documents but can not figure out if a Hive table can be a streaming source. A file can be a streaming source, so I can always read the data from the hive table, store it in a file and then use that as a streaming source but I was wondering if it was possible to avoid this indirect route.
Upvotes: 2
Views: 1872
Reputation: 5125
You aren't really using the right tool for the job here. Yes, spark reads from hive but so do a lot of other tools. Spark is made to do batch processing, weather it's steaming or processing. Rate control would require custom code.
You might look at other open source tools, like NIFI that know how to work with hive and also understand hive. Here's a good discussion on how to control rate flow with Nifi.
Or look at Nutch which was made to scrape the internet into hadoop.
If you wanted to abuse spark to do this, you might be able to do something with foreachPartitions and repartitioning the partitions up into smaller chunks, and reducing the number of cores/executors, so that the entire job took longer to process... but really your anti-optimizing at that point... again not really a good look. Possible but not really a good use of Spark.
Upvotes: 1