Reputation: 36307
I am using scrapy 1.1 to scrape a website. The site requires periodic relogin. I can tell when this is needed because when login is required a 302 redirection occurs. Based on # http://sangaline.com/post/advanced-web-scraping-tutorial/ , I have subclassed the RedirectMiddleware, making the location http header available in the spider under:
request.meta['redirect_urls']
My problem is that after logging in , I have set up a function to loop through 100 pages to scrape . Lets say after 15 pages I see that I have to log back in (based on the contents of request.meta['redirect_urls']) . My code looks like:
def test1(self, response):
......
for row in empties: # 100 records
d = object_as_dict(row)
AA
yield Request(url=myurl,headers=self.headers, callback=self.parse_lookup, meta={d':d}, dont_filter=True)
def parse_lookup(self, response):
if 'redirect_urls' in response.meta:
print str(response.meta['redirect_urls'])
BB
d = response.meta['d']
So as you can see, I get 'notified' of the need to relogin in parse_lookup at BB , but need to feed this information back to cancel the loop creating requests in test1 (AA). How can I make the information in parse lookup available in the prior callback function?
Upvotes: 8
Views: 676
Reputation: 852
Why not use a DownloaderMiddleware?
You could write a DownloaderMiddleware like so:
Edit: I have edited the original code to address a second problem the OP had in the comments.
from scrapy.http import Request
class CustomMiddleware():
def process_response(self, request, response, spider):
if 'redirect_urls' in response.meta:
# assuming your spider has a method for handling the login
original_url = response.meta["redirect_urls"][0]
return Request(url="login_url",
callback=spider.login,
meta={"original_url": original_url})
return response
So you "intercept" the response before it goes to the parse_lookup and relogin/fix what is wrong and yield new requests...
Like Tomáš Linhart said the requests are asynchronous so I don't know if you could run into problems by "reloging in" several times in a row, as multiple requests might be redirected at the same time.
Remember to add the middleware to your settings:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 542,
'myproject.middlewares.CustomDownloaderMiddleware': 543,
}
Upvotes: 5
Reputation: 1548
I think it should be better not to try all 100 requests all at once, instead you should try to "serialize" the requests, for example you could add all your empties
in the request's meta
and pop them out as necessary, or put the empties
as a field of your spider.
Another alternative would be to use the scrapy-inline-requests package to accomplish what you want, but you should probably extend your middleware to perform the login.
Upvotes: 1
Reputation: 1364
Don't iterate over the 100 items and create requests for all of them. Instead, just create a request for the first item, process it in your callback function, yield the item, and only after that's done create the request for the second item and yield it. With this approach, you can check for the location header in your callback and either make the request for the next item or login and repeat the current item request.
For example:
def parse_lookup(self, response):
if 'redirect_urls' in response.meta:
# It's a redirect
yield Request(url=your_login_url, callback=self.parse_login_response, meta={'current_item_url': response.request.url}
else:
# It's a normal response
item = YourItem()
... # Extract your item fields from the response
yield item
next_item_url = ... # Extract the next page URL from the response
yield Request(url=next_item_url, callback=self.parse_lookup)
This assumes that you can get the next item URL from the current item page, otherwise just put the list of URLs in the first request's META dict and pass it along.
Upvotes: 2
Reputation: 10220
You can't achieve what you want because Scrapy uses asynchronous processing.
In theory you could use approach partially suggested in comment by @Paulo Scardine, i.e. raise an exception in parse_lookup
. For it to be useful, you would then have to code your spider middleware and handle this exception in process_spider_exception
method to log back in and retry failed requests.
But I think better and simpler approach would be to do the same once you detect the need to login, i.e. in parse_lookup
. Not sure exactly how CONCURRENT_REQUESTS_PER_DOMAIN
works, but setting this to 1
might let you process one request at time and so there should be no failing requests as you always log back in when you need to.
Upvotes: 2