Reputation: 1217
This is how my spider is set up
class CustomSpider(CrawlSpider):
name = 'custombot'
allowed_domains = ['www.domain.com']
start_urls = ['http://www.domain.com/some-url']
rules = (
Rule(SgmlLinkExtractor(allow=r'.*?something/'), callback='do_stuff', follow=True),
)
def start_requests(self):
return Request('http://www.domain.com/some-other-url', callback=self.do_something_else)
It goes to /some-other-url but not /some-url. What is wrong here? The url specified in start_urls are the ones that need links extracted and sent through the rules filter, where as the ones in start_requests are sent directly to the item parser so it doesn't need to pass through the rules filters.
Upvotes: 12
Views: 22114
Reputation: 7889
From the documentation for start_requests
, overriding start_requests
means that the urls defined in start_urls
are ignored.
This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified. If particular URLs are specified, the make_requests_from_url() is used instead to create the Requests.
[...]
If you want to change the Requests used to start scraping a domain, this is the method to override.
If you want to just scrape from /some-url, then remove start_requests
. If you want to scrape from both, then add /some-url to the start_urls
list.
Upvotes: 15