Macro
Macro

Reputation: 93

Crawling a site recursively using scrapy

I am trying to scrap a site using scrapy.

This is the code I have written so far based on http://thuongnh.com/building-a-web-crawler-with-scrapy/ (original code does not work at all so I tried to rebuild it)

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders             import Spider
from scrapy.selector         import HtmlXPathSelector
from nettuts.items            import NettutsItem
from scrapy.http            import Request
from scrapy.linkextractors import LinkExtractor


class MySpider(Spider):
    name = "nettuts"
    allowed_domains = ["net.tutsplus.com"]
    start_urls = ["http://code.tutsplus.com/posts?"]
    rules = [Rule(LinkExtractor(allow = ('')), callback = 'parse', follow = True)]

    def parse(self, response):
        hxs  = HtmlXPathSelector(response)
        item = []

        titles    = hxs.xpath('//li[@class="posts__post"]/a/text()').extract()
        for title in titles:
            item             = NettutsItem()
            item["title"]     = title
            yield item
        return

Problem is that crawler goes to the start page but does not scrap any pages after that.

Upvotes: 8

Views: 11177

Answers (2)

Santosh Pillai
Santosh Pillai

Reputation: 8623

Following can be a good idea to start with:

There can be two use cases for 'Crawling a site recursively using scrapy'.

A). We just want to move across the website using, say, the pagination buttons of the table and fetch data. This is relatively straight forward.

class TrainSpider(scrapy.Spider):
    name = "trip"
    start_urls = ['somewebsite']
    def parse(self, response):
        ''' do something with this parser '''
        next_page = response.xpath("//a[@class='next_page']/@href").extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)`

Observe the last 4 lines. Here

  1. We are getting the next page link form next page xpath from the 'Next' pagination button.
  2. The if condition checks, if its not the end of the pagination.
  3. Join this link (that we got in step 1) with the main url using urljoin
  4. A recursive call to the 'parse' call back method.

B)Not only we want to move across pages, but we also want to extract data from one or more links in that page.

class StationDetailSpider(CrawlSpider):
    name = 'train'
    start_urls = [someOtherWebsite]
    rules = (
        Rule(LinkExtractor(restrict_xpaths="//a[@class='next_page']"), follow=True),
        Rule(LinkExtractor(allow=r"/trains/\d+$"), callback='parse_trains')
    )
    def parse_trains(self, response):
        '''do your parsing here'''

Over here, observe that:

  1. We are using the 'CrawlSpider' subclass of the 'scrapy.Spider' parent class

  2. We have set to 'Rules'

    a) The first rule just checks if there is a 'next_page' available and follows it.

    b) The second rule requests for all the links on a page that are in the format, say '/trains/12343' and then calls the 'parse_trains' to perform and parsing operation.

  3. Important: Note that we don't want to use the regular 'parse' method over here as we are using 'CrawlSpider' subclass. This class also has a 'parse' method so we don't want to override that. Just remember to name your call back method something other than 'parse'.

Upvotes: 15

alecxe
alecxe

Reputation: 473763

The problem is what Spider class you are using as a base. The scrapy.Spider is a simple spider that does not support rules and link extractors.

Instead, use CrawlSpider:

class MySpider(CrawlSpider):

Upvotes: 7

Related Questions