Scrapy Follow link

Question

I have the following CrawlSpider which I can't get to follow links on a university website. I think this is because of the precarious markup but I am not sure. I have tried to add a rule but it won't follow. How can I make this work?

It works as a single page spider, and scrapes page 1 ok, but doesn't follow links.

Note, not homework, just me playing about and got board of scraping Dmoz. All help is appreciated.

# -*- coding: utf-8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from example.items import ExampleItem

class ExampleSpider(CrawlSpider):
    name = "example"
    allowed_domains = ["example.ac.uk"]
    start_urls = (
        'http://www.example.ac.uk/courses/course-finder?query=&f.Year_of_entry|E=2015/16&f.Type|D=Undergraduate',
        ''
    )

    rules = (Rule (SgmlLinkExtractor(allow=("index\.php", ), callback="parse"),))

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//div[@id="course_list"]')
        items = []

        for site in sites:
            item = ExampleItem()
            item['link'] = site.xpath('//h2/a/@href').extract()
            item['name'] = site.xpath('//h2/a/text()').extract()
            items.append(item)

        return items

The pagination markup on the website is as follows:

   
            
                Previous

                    Go to page 1


                    Go to page 2


                    Go to page 3


                    Go to page 4


                    Go to page 5


                    Go to page 6


                    Go to page 7


                    Go to page 8


                    Go to page 9


                    Go to page 10


                    Next

alecxe · Accepted Answer

At least the first problem you have is that you are defining the callback inside a link extractor, but should be defining on the rule level:

rules = (
    Rule(LinkExtractor(allow=("index\.php", )), callback="parse_result"),
)

def parse_result(self, response):
    ...

Besides, you need a separate rule to follow pagination:

rules = (
    Rule(LinkExtractor(allow=("index\.php", )), callback="parse_result"),
    Rule(LinkExtractor(restrict_xpaths='//div[@class="pagination"]'), follow=True),
)

Scrapy Follow link

Answers (1)

Related Questions