Reputation: 3793
I need to crawl a website at a rate, lets say, 8 pages per minute.Now I wish the requests which I make to the remote server to be uniformly distributed over the minute, so that it doesn't harm the server it is requesting to.
How can I maintain a uniform time difference in seconds between two consecutive requests ? What is the best way to do this ?
Upvotes: 0
Views: 145
Reputation: 1010
There are really two separate issues here. Let's tackle them separately:
FIRST QUESTION
I need to crawl a website at a rate, lets say, 8 pages per minute....so that it doesn't harm the server it is requesting to.
Paraphrase: I want to not send more than 8 requests per minute, because I want to be nice to the remote server.
For this answer, there is a related Stack Overflow question about rate limiting using PHP and Curl.
SECOND QUESTION
I wish the requests which I make to the remote server to be uniformly distributed over the minute....How can I maintain a uniform time difference in seconds between two consecutive requests
Paraphrase: I want to have the same amount of time in between each query.
This is a different question than the first one, and trickier. To do this, you will need to use a clock to keep track of the before and after each request, and keep constantly averaging the time taken for a request and how much sleep you request, and/or how often you call get(). You will also have to take into account how long each request is taking (what if you get an extremely laggy connection which lowers your average so that you're only doing 3 or 4 requests per minute...)
I personally don't think this is actually what you need to do "so that it doesn't harm the server".
Here's why: Usually rate limits are set with an "upper bound per lowest time slice". So this means that "8 requests per minute" means that they can all come at once in the minute, but not more than 8 per minute. There is no expectation by the rate limiter that they'll be uniformly distributed over the minute. If they did want that, they'd have said "one request every five seconds".
Upvotes: 1