Reputation: 71
How to follow links in this example : http://snippets.scrapy.org/snippets/7/ ? The script stop after visiting the link of the first page.
class MySpider(BaseSpider):
"""Our ad-hoc spider"""
name = "myspider"
start_urls = ["http://stackoverflow.com/"]
question_list_xpath = '//div[@id="content"]//div[contains(@class, "question-summary")]'
def parse(self, response):
hxs = HtmlXPathSelector(response)
for qxs in hxs.select(self.question_list_xpath):
loader = XPathItemLoader(QuestionItem(), selector=qxs)
loader.add_xpath('title', './/h3/a/text()')
loader.add_xpath('summary', './/h3/a/@title')
loader.add_xpath('tags', './/a[@rel="tag"]/text()')
loader.add_xpath('user', './/div[@class="started"]/a[2]/text()')
loader.add_xpath('posted', './/div[@class="started"]/a[1]/span/@title')
loader.add_xpath('votes', './/div[@class="votes"]/div[1]/text()')
loader.add_xpath('answers', './/div[contains(@class, "answered")]/div[1]/text()')
loader.add_xpath('views', './/div[@class="views"]/div[1]/text()')
yield loader.load_item()
i've tried to change :
class MySpider(BaseSpider):
To
class MySpider(CrawlSpider)
And add
rules = (
Rule(SgmlLinkExtractor(allow=()),
callback='parse',follow=True),
)
But it doesn't crawl all the site
Thanks,
Upvotes: 1
Views: 1902
Reputation: 9502
Yes, you need to subclass CrawlSpider, and rename parse
function to something like parse_page
, because CrawlSpider uses parse
to start scraping.
This was already answered
Upvotes: 1