Dmitrii Mikhailov
Dmitrii Mikhailov

Reputation: 5231

How to get response body in scrapy downloader middleware

I need to be able to retry the request if certain xpaths were not found on the page. So I wrote this middleware:

class ManualRetryMiddleware(RetryMiddleware):
    def process_response(self, request, response, spider):
        if not spider.retry_if_not_found:
            return response
        if not hasattr(response, 'text') and response.status != 200:
            return super(ManualRetryMiddleware, self).process_response(request, response, spider)
        found = False
        for xpath in spider.retry_if_not_found:
            if response.xpath(xpath).extract():
                found = True
                break
        if not found:
            return self._retry(request, "Didn't find anything useful", spider)
        return response

And registered it in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.ManualRetryMiddleware': 650,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}

When I run the spider, I get

AttributeError: 'Response' object has no attribute 'xpath'

I tried to manually create selector and run xpath on it... But the response has no text property and response.body is bytes, not str...

So how can I check page content in middleware? It's possible that some pages won't contain details that I need, so I'd like to be able to try them again later.

Upvotes: 2

Views: 1378

Answers (2)

mouch
mouch

Reputation: 347

Also take care of your middleware position. It needs to be before the scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware otherwise, you may end up trying to decode compressed data (which is indeed not working). Check the response.header to know if the response is compressed - Content-Encoding: gzip.

Upvotes: 1

Tomáš Linhart
Tomáš Linhart

Reputation: 10210

The reason response doesn't contain xpath method is that response parameter in process_response method of downloader middleware is of type scrapy.http.Response, see the documentation. Only scrapy.http.TextResponse (and scrapy.http.HtmlResponse) do have xpath method. So before using xpath, create HtmlResponse object from response. The corresponding part of your class would become:

...
new_response = scrapy.http.HtmlResponse(response.url, body=response.body)
if new_response.xpath(xpath).extract():
    found = True
    break
...

Upvotes: 1

Related Questions