Ilir
Ilir

Reputation: 450

How to handle http error codes using CrawlSpider in scrapy

I am trying to use scrapy to test some websites and their subsites for the http return codes, resp to detect errors within the 400 and 500 range. However addtionally I would like to also see and handle codes in the 300 range. I have been trying for days and checking the docs, however I am stuck and do not find the solution. Thanks for helping out!

Following you will see the spider I am creating using CrawlSpider. To goal is to see/catch http responses within the error ranges in my parse_item() function. I have added handle_httpstatus_all = True to the settings.py but nothing besides HTTP_STATUS = 200 is coming in at parse_item.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy import log


class IcrawlerSpider(CrawlSpider):

name = 'icrawler'
def __init__(self, *args, **kwargs):
    # We are going to pass these args from our django view.
    # To make everything dynamic, we need to override them inside__init__method
    handle_httpstatus_all = True
    self.url = kwargs.get('url')
    self.domain = kwargs.get('domain')
    self.start_urls = [self.url]
    self.allowed_domains = [self.domain]

    IcrawlerSpider.rules = [
       Rule(LinkExtractor(unique=True), callback='parse_item'),
    ]
    super(IcrawlerSpider, self).__init__(*args, **kwargs)

def parse_item(self, response):
    # You can tweak each crawled page here
    # Don't forget to return an object.
    if response.status==403:
        self.logger.errror("ERROR_CODE_RETURNED: " + response.status)
    i = {}
    i['url'] = response.url
    i['status_code'] = response.status
    return i

Most probably I am missing out something elementary when it comes to the reason why no error codes are being passed.

Upvotes: 0

Views: 2825

Answers (2)

malberts
malberts

Reputation: 2536

If you need to do this with Rules, then you can modify the generated Requests by providing a process_request callback. Here's a summary:

class IcrawlerSpider(CrawlSpider):
    def __init__(self, *args, **kwargs):
        # ...
        IcrawlerSpider.rules = [
           Rule(LinkExtractor(unique=True), callback='parse_item', process_request='add_meta'),
        ]

    def add_meta(self, request):
        request.meta['handle_httpstatus_all'] = True
        return request

Refer to the documentation and an example.

Upvotes: 0

vezunchik
vezunchik

Reputation: 3717

Flag handle_httpstatus_all should be set in meta of each your request, check docs here.

About settings, you can play with HTTPERROR_ALLOW_ALL or set list of HTTPERROR_ALLOWED_CODES.

Like this:

class IcrawlerSpider(CrawlSpider):
    name = 'icrawler'
    custom_settings = {'HTTPERROR_ALLOW_ALL': True}

Or refactor your spider to call requests like yield Request(link, self.parse_item, meta={'handle_httpstatus_all': True}) I don't know how to apply meta params to Rules.

Upvotes: 5

Related Questions