pablo07
pablo07

Reputation: 71

How to follow link with scrappy

How to follow links in this example : http://snippets.scrapy.org/snippets/7/ ? The script stop after visiting the link of the first page.

class MySpider(BaseSpider):
"""Our ad-hoc spider"""
name = "myspider"
start_urls = ["http://stackoverflow.com/"]

question_list_xpath = '//div[@id="content"]//div[contains(@class, "question-summary")]'

def parse(self, response):
    hxs = HtmlXPathSelector(response)

    for qxs in hxs.select(self.question_list_xpath):
        loader = XPathItemLoader(QuestionItem(), selector=qxs)
        loader.add_xpath('title', './/h3/a/text()')
        loader.add_xpath('summary', './/h3/a/@title')
        loader.add_xpath('tags', './/a[@rel="tag"]/text()')
        loader.add_xpath('user', './/div[@class="started"]/a[2]/text()')
        loader.add_xpath('posted', './/div[@class="started"]/a[1]/span/@title')
        loader.add_xpath('votes', './/div[@class="votes"]/div[1]/text()')
        loader.add_xpath('answers', './/div[contains(@class, "answered")]/div[1]/text()')
        loader.add_xpath('views', './/div[@class="views"]/div[1]/text()')

        yield loader.load_item()

i've tried to change :

class MySpider(BaseSpider):

To

class MySpider(CrawlSpider)

And add

rules = (
    Rule(SgmlLinkExtractor(allow=()),
         callback='parse',follow=True),
)

But it doesn't crawl all the site

Thanks,

Upvotes: 1

Views: 1902

Answers (1)

reclosedev
reclosedev

Reputation: 9502

Yes, you need to subclass CrawlSpider, and rename parse function to something like parse_page, because CrawlSpider uses parse to start scraping. This was already answered

Upvotes: 1

Related Questions