Reputation: 251
I am trying to scrape using scrape framework. Some requests are redirected but the callback function set in the start_requests is not called for these redirected url requests but works fine for the non-redirected ones.
I have the following code in the start_requests function:
for user in users:
yield scrapy.Request(url=userBaseUrl+str(user['userId']),cookies=cookies,headers=headers,dont_filter=True,callback=self.parse_p)
But this self.parse_p is called only for the Non-302 requests.
Upvotes: 4
Views: 9054
Reputation: 9284
By default, scrapy is not following 302 redirects.
In your spider you can make use of the custom_settings attribute:
custom_settings A dictionary of settings that will be overridden from the project wide configuration when running this spider. It must be defined as a class attribute since the settings are updated before instantiation.
Set the number of redirects that a url request can be redirected as follows:
class MySpider(scrapy.Spider):
name = "myspider"
allowed_domains = ["example.com"]
start_urls = [ "http://www.example.com" ]
custom_settings = { 'REDIRECT_MAX_TIMES': 333 }
def start_requests(self):
# Your code here
I set 333 as an example limit.
I hope this helps.
Upvotes: 0
Reputation: 2204
I guess you get a callback for the final page (after the redirect). Redirects are been taken care by the RedirectMiddleware
. You could disable it and then you would have to do all the redirects manually. If you wanted to selectively disable redirects for a few types of Requests you can do it like this:
request = scrapy.Request(url, meta={'dont_redirect': True} callback=self.manual_handle_of_redirects)
I'm not sure that the intermediate Requests/Responses are very interesting though. That's also what RedirectMiddleware
believes. As a result, it does the redirects automatically and saves the intermediate URLs (the only interesting thing) in:
response.request.meta.get('redirect_urls')
You have a few options!
Example spider:
import scrapy
class DimSpider(scrapy.Spider):
name = "dim"
start_urls = (
'http://example.com/',
)
def parse(self, response):
yield scrapy.Request(url="http://example.com/redirect302.php", dont_filter=True, callback=self.parse_p)
def parse_p(self, response):
print response.request.meta.get('redirect_urls')
print "done!"
Example output...
DEBUG: Crawled (200) <GET http://www.example.com/> (referer: None)
DEBUG: Redirecting (302) to <GET http://myredirect.com> from <GET http://example.com/redirect302.php>
DEBUG: Crawled (200) <GET http://myredirect.com/> (referer: http://example.com/redirect302.com/)
['http://example.com/redirect302.php']
done!
If you really want to scrape the 302 pages, you have to explicitcly allow it. For example here, I allow 302
and set dont_redirect
to True
:
handle_httpstatus_list = [302]
def parse(self, response):
r = scrapy.Request(url="http://example.com/redirect302.php", dont_filter=True, callback=self.parse_p)
r.meta['dont_redirect'] = True
yield r
The end result is:
DEBUG: Crawled (200) <GET http://www.example.com/> (referer: None)
DEBUG: Crawled (302) <GET http://example.com/redirect302.com/> (referer: http://www.example.com/)
None
done!
This spider should manually follow 302 urls:
import scrapy
class DimSpider(scrapy.Spider):
name = "dim"
handle_httpstatus_list = [302]
def start_requests(self):
yield scrapy.Request("http://page_with_or_without_redirect.html",
callback=self.parse200_or_302, meta={'dont_redirect':True})
def parse200_or_302(self, response):
print "I'm on: %s with status %d" % (response.url, response.status)
if 'location' in response.headers:
print "redirecting"
return [scrapy.Request(response.headers['Location'],
callback=self.parse200_or_302, meta={'dont_redirect':True})]
Be careful. Don't omit setting handle_httpstatus_list = [302]
otherwise you will get "HTTP status code is not handled or not allowed".
Upvotes: 7