Reputation: 1227
I have one URL in start_urls
The first time the crawler loads the page, it is first shown a 403 error page after which the crawler shuts down.
What I need to do is fill out a captcha on that page and it will then let me access the page. I know how to write the code for bypassing the captcha but where do I put this code in my spider class?
I need to add this on other pages as well when it encounters the same problem.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
class MySpider(CrawlSpider):
name = "myspider"
allowed_domains = ["mydomain.com"]
start_urls = ["http://mydomain.com/categories"]
handle_httpstatus_list = [403] #Where do I now add the captcha bypass code?
download_delay = 5
rules = [Rule(SgmlLinkExtractor(allow=()), callback='parse_item')]
def parse_item (self, response):
pass
Upvotes: 4
Views: 2473
Reputation: 298532
Set handle_httpstatus_list
to treat 403
a successful response code:
class MySpider(CrawlSpider):
handle_httpstatus_list = [403]
As for bypassing the actual captcha, you need to override parse
to handle all pages with a 403
response code differently:
def parse(self, response):
if response.status_code == 403:
return self.handle_captcha(response):
yield CrawlSpider.parse(self, response)
def handle_captcha(self, response):
# Fill in the captcha and send a new request
return Request(...)
Upvotes: 7