Crypto
Crypto

Reputation: 1227

Handling Error Pages in Scrapy

I have one URL in start_urls

The first time the crawler loads the page, it is first shown a 403 error page after which the crawler shuts down.

What I need to do is fill out a captcha on that page and it will then let me access the page. I know how to write the code for bypassing the captcha but where do I put this code in my spider class?

I need to add this on other pages as well when it encounters the same problem.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.selector import Selector

class MySpider(CrawlSpider):
    name = "myspider"
    allowed_domains = ["mydomain.com"]
    start_urls = ["http://mydomain.com/categories"]
    handle_httpstatus_list = [403] #Where do I now add the captcha bypass code?
    download_delay = 5
    rules = [Rule(SgmlLinkExtractor(allow=()), callback='parse_item')]

    def parse_item (self, response):
        pass

Upvotes: 4

Views: 2473

Answers (1)

Blender
Blender

Reputation: 298532

Set handle_httpstatus_list to treat 403 a successful response code:

class MySpider(CrawlSpider):
    handle_httpstatus_list = [403]

As for bypassing the actual captcha, you need to override parse to handle all pages with a 403 response code differently:

def parse(self, response):
    if response.status_code == 403:
        return self.handle_captcha(response):

    yield CrawlSpider.parse(self, response)

def handle_captcha(self, response):
    # Fill in the captcha and send a new request
    return Request(...)

Upvotes: 7

Related Questions