Scrapy - target specified URLs only

Question

Am using Scrapy to browse and collect data, but am finding that the spider is crawling lots of unwanted pages. What I'd prefer the spider to do is just begin from a set of defined pages and then parse the content on those pages and then finish. I've tried to implement a rule like the below but it's still crawling a whole series of other pages as well. Any suggestions on how to approach this?

rules = (
    Rule(SgmlLinkExtractor(), callback='parse_adlinks', follow=False),  
)

Thanks!

Granitosaurus · Accepted Answer

Your extractor is extracting every link because it doesn't have any rule arguments set.

If you take a look at the official documentation, you'll notice that scrapy LinkExtractors have lots of parameters that you can set to customize what your linkextractors extract.

Example:

rules = (
    # only specific domain links
    Rule(LxmlLinkExtractor(allow_domains=['scrapy.org', 'blog.scrapy.org']), <..>),  
    # only links that match specific regex
    Rule(LxmlLinkExtractor(allow='.+?/page\d+.html)', <..>),  
    # don't crawl speicific file extensions
    Rule(LxmlLinkExtractor(deny_extensions=['.pdf','.html'], <..>),  
)

You can also set allowed domains for your spider if you don't want it to wonder off somewhere:

class MySpider(scrapy.Spider):
    allowed_domains = ['scrapy.org']
    # will only crawl pages from this domain ^

Scrapy - target specified URLs only

Answers (1)

Related Questions