Reputation: 5231
I need to be able to retry the request if certain xpaths were not found on the page. So I wrote this middleware:
class ManualRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
if not spider.retry_if_not_found:
return response
if not hasattr(response, 'text') and response.status != 200:
return super(ManualRetryMiddleware, self).process_response(request, response, spider)
found = False
for xpath in spider.retry_if_not_found:
if response.xpath(xpath).extract():
found = True
break
if not found:
return self._retry(request, "Didn't find anything useful", spider)
return response
And registered it in settings.py
:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.ManualRetryMiddleware': 650,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}
When I run the spider, I get
AttributeError: 'Response' object has no attribute 'xpath'
I tried to manually create selector and run xpath on it... But the response has no text
property and response.body
is bytes, not str...
So how can I check page content in middleware? It's possible that some pages won't contain details that I need, so I'd like to be able to try them again later.
Upvotes: 2
Views: 1378
Reputation: 347
Also take care of your middleware position. It needs to be before the scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware
otherwise, you may end up trying to decode compressed data (which is indeed not working). Check the response.header to know if the response is compressed - Content-Encoding: gzip
.
Upvotes: 1
Reputation: 10210
The reason response
doesn't contain xpath
method is that response
parameter in process_response
method of downloader middleware is of type scrapy.http.Response
, see the documentation. Only scrapy.http.TextResponse
(and scrapy.http.HtmlResponse
) do have xpath
method. So before using xpath
, create HtmlResponse
object from response
. The corresponding part of your class would become:
...
new_response = scrapy.http.HtmlResponse(response.url, body=response.body)
if new_response.xpath(xpath).extract():
found = True
break
...
Upvotes: 1