Reputation: 11
I am currently using scrapy's CrawlSpider to look for specific info on a list of multiple start_urls. What I would like to do is stop scraping a specific start_url's domain once I've found the information I've looked for, so it won't keep hitting a domain and will instead just hit the other start_urls.
Is there a way to do this? I have tried appending to deny_domains like so:
deniedDomains = []
...
rules = [Rule(SgmlLinkExtractor(..., deny_domains=(etc), ...)]
...
def parseURL(self, response):
...
self.deniedDomains.append(specificDomain)
Appending doesn't seem to stop the crawling, but if I start the spider with the intended specificDomain then it'll stop as requested. So I'm assuming that you can't change the deny_domains list after the spider's started?
Upvotes: 1
Views: 1521
Reputation: 505
The best way to do this , is to maintain your own dynamic_deny_domain
list in your Spider class :
process_request(request, spider):
spider.dynamic_deny_domain
list, None
otherwise.Then add your downloaderMiddleWare to Middleware list in scrapy settings , at first position
'myproject.downloadermiddleware.IgnoreDomainMiddleware': 50,
Should do the trick.
Upvotes: 1
Reputation: 6972
Something ala?
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class MySpider(CrawlSpider):
name = "foo"
allowed_domains = ["example.org"]
start_urls = ["http://www.example.org/foo/",]
rules = (
Rule(SgmlLinkExtractor(
allow=('/foo/[^/+]',),
deny_domains=('example.com',)),
callback='parseURL'),
)
def parseURL(self, response):
# here the rest of your code
Upvotes: 0