Scrapy does not crawl through data contained in start URLs

Question

I am trying to crawl an entire website using scrapy. As per scarpy's documentation

start_urls - A list of URLs where the spider will begin to crawl from, when no particular URLs are specified. So, the first pages downloaded will be those listed here. The subsequent URLs will be generated successively from data contained in the start URLs.

So according to this definition scrapy should spider through all the sub-urls on the pages mentioned under start_urls but it only crawls the url that I have specified. I did specify the rule as mentioned under Scrapy - Crawl whole website, but it didn't help. It only scrapes and outputs pages that I specify in start_urls.

Here is a snippet of my code:

class AcdivocaFirstSpider(scrapy.Spider):
    name = "example_sample"
    allowed_domains = ["example.org"]
    start_urls = ["http://www.example.org/site/id/home"]
    rules = rules = [Rule(SgmlLinkExtractor(), callback='parse_item', follow=True)]

    def parse(self, response):
        filename = response.url.split("/")[-1] #so eg it would name 'home'
        open(filename, 'wb').write(response.body)

This yields a single file with extracted HTML data for 'home' page. How do I get it to recursively crawl the entire website starting from the homepage?

Any help is appreciated. Thank you.

paul trmbrth · Accepted Answer

2 things to change:

to use rules, make AcdivocaFirstSpider a subclass of scrapy.contrib.spiders.CrawlSpider, not scrapy.Spider

The subsequent URLs will be generated successively from data contained in the start URLs.

This phrase is misleading. scrapy.Spider by itself doesn't do anything special with those start URLs: it downloads them and passes the response's body to parse(). If the parse() callback is implemented to yield further requests, then yes, subsequent URLs will be generated from data from those URLs, but that's not automatic/automagical.

when using scrapy.contrib.spiders.CrawlSpider, you need to NOT override the built-in parse() method, that's where the rules are checked and requests for pages are generated. So you need to rename parse to parse_item (as referenced in your rule)

See the warning in the docs on crawling rules.

Scrapy does not crawl through data contained in start URLs

Answers (1)

Related Questions