Reputation: 13208
I'm trying to get Scrapy to login to a website, and then being able to goto particular pages of it and then scrape information. I have the below code:
class DemoSpider(InitSpider):
name = "demo"
allowed_domains = ['example.com']
login_page = "https://www.example.com/"
start_urls = ["https://www.example.com/secure/example"]
rules = (Rule(SgmlLinkExtractor(allow=r'\w+'),callback='parse_item', follow=True),)
# Initialization
def init_request(self):
"""This function is called before crawling starts."""
return Request(url=self.login_page, callback=self.login)
# Perform login with the username and password
def login(self, response):
"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'name': 'user', 'password': 'password'},
callback=self.check_login_response)
# Check the response after logging in, make sure it went well
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
else:
self.log('will initialize')
self.initialized(response)
def parse_item(self, response):
self.log('got to the parse item page')
Everytime I run the spider, it logs in and gets to the initialize. However, it NEVER matches a rule. Is there a reason for this? I checked the below site regarding this:
Crawling with an authenticated session in Scrapy
There're also a number of other sites including the documentation. Why is it that after initializing, it never goes through the start_urls
and then scrapes each page?
Upvotes: 0
Views: 590
Reputation: 7889
From looking at other questions, it would appear that you need to return self.initialized with no parameters ie return self.initialized()
Upvotes: 1
Reputation: 31568
You can't use rules in InitSpider
. Its only available in crawlspider
Upvotes: 3