Reputation: 23551
The original code:
class HomepageSpider(BaseSpider):
name = 'homepage_spider'
def start_requests(self):
...
def parse(self, response):
# harvest some data from response
item = ...
yield scrapy.Request(
"https://detail-page",
callback=self.parse_details,
cb_kwargs={"item": item}
)
def parse_details(self, response, item):
# harvest details
...
yield item
This is the standard way to follow links on a page. However it has a flaw: if there is an http error (e.g. 503) or connection error when following the 2nd URL, parse_details
is never called, and yield item
is never executed. And so all data is lost.
Changed code:
class HomepageSpider(BaseSpider):
name = 'homepage_spider'
def start_requests(self):
...
def parse(self, response):
# harvest some data from response
item = ...
yield scrapy.Request(
"https://detail-page",
callback=self.parse_details,
cb_kwargs={"item": item}
)
yield item
def parse_details(self, response, item):
# harvest details
...
Changed code does not work, it seems yield item
is immediately executed before parse_details
is run (perhaps due to Twisted framework, this behavior is different from what's expected in asynio
library) and so the item is always yielded with incomplete data.
How to make sure the yield item
is executed after all links are followed? regardless of success or failure. Is something like
res1 = scrapy.Request(...)
res2 = scrapy.Request(...)
yield scrapy.join([res1, res2]) # block until both urls are followed?
yield item
possible?
Upvotes: 1
Views: 394
Reputation: 1487
you can send the failed requests to a function (whenever an Error happens),yield the item from there.
from scrapy.spidermiddlewares.httperror import HttpError
class HomepageSpider(BaseSpider):
name = 'homepage_spider'
def start_requests(self):
...
def parse(self, response):
# harvest some data from response
item = ...
yield scrapy.Request(
"https://detail-page",
callback=self.parse_details,
meta={"item": item},
errback=self.my_handle_error
)
def parse_details(self, response):
item = response.meta['item']
# harvest details
...
yield item
def my_handle_error(self,failure,item):
response = failure.value.response
print(f"Error on {response.url}")
#you can do much depth error checking here to see what type of failure like DNSlookup,timeouterror,httperror ...
yield item
second Edit to yield the item
yield scrapy.Request(
"https://detail-page",
callback=self.parse_details,
cb_kwargs={"item": item},
errback=errback=lambda failure, item=item: self.my_handle_error(failure, item)
)
def my_handle_error(self,failure,item):
yield item
Upvotes: 1