Reputation: 363
Am using Scrapy to browse and collect data, but am finding that the spider is crawling lots of unwanted pages. What I'd prefer the spider to do is just begin from a set of defined pages and then parse the content on those pages and then finish. I've tried to implement a rule like the below but it's still crawling a whole series of other pages as well. Any suggestions on how to approach this?
rules = (
Rule(SgmlLinkExtractor(), callback='parse_adlinks', follow=False),
)
Thanks!
Upvotes: 1
Views: 249
Reputation: 21436
Your extractor is extracting every link because it doesn't have any rule arguments set.
If you take a look at the official documentation, you'll notice that scrapy LinkExtractors have lots of parameters that you can set to customize what your linkextractors extract.
Example:
rules = (
# only specific domain links
Rule(LxmlLinkExtractor(allow_domains=['scrapy.org', 'blog.scrapy.org']), <..>),
# only links that match specific regex
Rule(LxmlLinkExtractor(allow='.+?/page\d+.html)', <..>),
# don't crawl speicific file extensions
Rule(LxmlLinkExtractor(deny_extensions=['.pdf','.html'], <..>),
)
You can also set allowed domains for your spider if you don't want it to wonder off somewhere:
class MySpider(scrapy.Spider):
allowed_domains = ['scrapy.org']
# will only crawl pages from this domain ^
Upvotes: 1