user8109580
user8109580

Reputation:

Crawling web pages with limitations

I have a question about crawling data from web pages. Some sites have limitations on requests, How can I do crawling in these kind of situations?

Upvotes: 2

Views: 158

Answers (1)

mattjegan
mattjegan

Reputation: 2884

When crawling sites you might find that you get rate limited because you have made too many requests to a site. For example, my site might block you for some number of seconds before I allow you to make another request. These limits can change depending on the site and how many and how often you make requests.

One way to get around these limits is to wait a little bit between requests using your languages sleep methods. In Python this is time.sleep(10).

If you still get blocked, you can try to cater to the ban time using increasing retry periods. For example, you get blocked on some request, so wait 5 second then try again (and get blocked), wait 10 seconds then try again (and get blocked), wait 20 seconds then try again (and get blocked), wait 40 seconds and so on until you either reach a limit where you want to give up or the server allows you to make the request successfully.

Upvotes: 2

Related Questions