XO39
XO39

Reputation: 483

How to detect HTTP response status code and set a proxy accordingly in scrapy?

Is there a way to set a new proxy ip (e.g.: from a pool) according to the HTTP response status code? For example, start up with an IP form an IP list till it gets a 503 response (or another http error code), then use the next one till it gets blocked,and so on, something like:

if http_status_code in [403, 503, ..., n]:
    proxy_ip = 'new ip'
    # Then keep using it till it's gets another error code

Any ideas?

Upvotes: 0

Views: 2733

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21436

Scrapy has a downloader middleware which is enabled by default to handle proxies. It's called HTTP Proxy Middleware and what it does is allows you to supply meta key proxy to your Request and use that proxy for this request.

There are few ways of doing this.
First one, straight-forward just use it in your spider code:

def parse(self, response):
    if response.status in range(400, 600):
        return Request(response.url, 
                       meta={'proxy': 'http://myproxy:8010'}
                       dont_filter=True)  # you need to ignore filtering because you already did one request to this url

Another more elegant way would be to use custom downloader middleware which would handle this for multiple callbacks and keep your spider code cleaner:

from project.settings import PROXY_URL
class MyDM(object):
    def process_response(self, request, response, spider):
        if response.status in range(400, 600):
            logging.debug('retrying [{}]{} with proxy: {}'.format(response.status, response.url, PROXY_URL)
            return Request(response.url, 
                           meta={'proxy': PROXY_URL}
                           dont_filter=True)
        return response

Note that by default scrapy doesn't let through any response codes other than 200 ones. Scrapy automatically handles redirect codes 300 with Redirect middleware and raises request errors on 400 and 500 with HttpError middleware. To handle requests other than 200 you need to either:

Specify that in Request Meta:

Request(url, meta={'handle_httpstatus_list': [404,505]})
# or for all 
Request(url, meta={'handle_httpstatus_all': True})

Set a project/spider wide parameters:

HTTPERROR_ALLOW_ALL = True  # for all
HTTPERROR_ALLOWED_CODES = [404, 505]  # for specific

as per http://doc.scrapy.org/en/latest/topics/spider-middleware.html#httperror-allowed-codes

Upvotes: 1

Related Questions