Reputation: 483
Is there a way to set a new proxy ip (e.g.: from a pool) according to the HTTP response status code? For example, start up with an IP form an IP list till it gets a 503 response (or another http error code), then use the next one till it gets blocked,and so on, something like:
if http_status_code in [403, 503, ..., n]:
proxy_ip = 'new ip'
# Then keep using it till it's gets another error code
Any ideas?
Upvotes: 0
Views: 2733
Reputation: 21436
Scrapy has a downloader middleware which is enabled by default to handle proxies. It's called HTTP Proxy Middleware and what it does is allows you to supply meta key proxy
to your Request
and use that proxy for this request.
There are few ways of doing this.
First one, straight-forward just use it in your spider code:
def parse(self, response):
if response.status in range(400, 600):
return Request(response.url,
meta={'proxy': 'http://myproxy:8010'}
dont_filter=True) # you need to ignore filtering because you already did one request to this url
Another more elegant way would be to use custom downloader middleware which would handle this for multiple callbacks and keep your spider code cleaner:
from project.settings import PROXY_URL
class MyDM(object):
def process_response(self, request, response, spider):
if response.status in range(400, 600):
logging.debug('retrying [{}]{} with proxy: {}'.format(response.status, response.url, PROXY_URL)
return Request(response.url,
meta={'proxy': PROXY_URL}
dont_filter=True)
return response
Note that by default scrapy doesn't let through any response codes other than 200
ones. Scrapy automatically handles redirect codes 300
with Redirect middleware
and raises request errors on 400
and 500
with HttpError middleware. To handle requests other than 200 you need to either:
Specify that in Request Meta:
Request(url, meta={'handle_httpstatus_list': [404,505]})
# or for all
Request(url, meta={'handle_httpstatus_all': True})
Set a project/spider wide parameters:
HTTPERROR_ALLOW_ALL = True # for all
HTTPERROR_ALLOWED_CODES = [404, 505] # for specific
as per http://doc.scrapy.org/en/latest/topics/spider-middleware.html#httperror-allowed-codes
Upvotes: 1