Reputation: 11

how to crawl web pages fast when the number of connection is limited

I wrote a web crawler to crawl product infomation from www.amazon.com by using urllib2,but it seems that amazon limit the connection for each IP to 1.

When I start more than one thread to crawl simultaneously, it raises HTTP Error 503: Service Temporarily Unavailable. I want to start more threads to crawl fast,so how can I fix this error?

Upvotes: 1

Answers (3)

lovesh

Reputation: 5401

Use python requests module to make connection through proxy IPs . The code will look like

import requests

proxies = {
  "http": "<an HTTP proxy IP>",
  "https": "<an HTTPS proxy IP>"
}
response = requests.get("http://your_url.com", proxies=proxies)

You should be able to get HTTP and HTTPS proxy ips from here See this for more help

Upvotes: 0

Sven

Reputation: 70863

You should probably switch to use the Amazon API for product queries.

Upvotes: 0

rmunn

Reputation: 36688

Short version: you can't, and it would be a bad idea to even try.

Upvotes: 1

how to crawl web pages fast when the number of connection is limited

Answers (3)

Related Questions