Mike77
Mike77

Reputation: 363

Scrapy - target specified URLs only

Am using Scrapy to browse and collect data, but am finding that the spider is crawling lots of unwanted pages. What I'd prefer the spider to do is just begin from a set of defined pages and then parse the content on those pages and then finish. I've tried to implement a rule like the below but it's still crawling a whole series of other pages as well. Any suggestions on how to approach this?

rules = (
    Rule(SgmlLinkExtractor(), callback='parse_adlinks', follow=False),  
)

Thanks!

Upvotes: 1

Views: 249

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21436

Your extractor is extracting every link because it doesn't have any rule arguments set.

If you take a look at the official documentation, you'll notice that scrapy LinkExtractors have lots of parameters that you can set to customize what your linkextractors extract.

Example:

rules = (
    # only specific domain links
    Rule(LxmlLinkExtractor(allow_domains=['scrapy.org', 'blog.scrapy.org']), <..>),  
    # only links that match specific regex
    Rule(LxmlLinkExtractor(allow='.+?/page\d+.html)', <..>),  
    # don't crawl speicific file extensions
    Rule(LxmlLinkExtractor(deny_extensions=['.pdf','.html'], <..>),  
)

You can also set allowed domains for your spider if you don't want it to wonder off somewhere:

class MySpider(scrapy.Spider):
    allowed_domains = ['scrapy.org']
    # will only crawl pages from this domain ^

Upvotes: 1

Related Questions