Reloading a page in Scrapy

Question

I'm very new to scrapy and I've been trying to scrape http://www.icbse.com/schools/state/maharashtra, but i've run into a problem. Out of the total number of school links displayed as available, the page only displays 50 at a time in an unordered fashion.

However, if the page is reloaded, it shows 50 newish listings of school links. Some of them are different from the first links before the refresh, while some stay the same.

What i wanted to do was to add the links to a Set() and once the len(set) reaches the length of the total schools, i wanted to send that Set to a parse function. I don't understand two things about getting around this problem.

Where does one define a set that would preserve the links and not get refreshed every time parse() is called.
How does one reload the page in scrapy.

Here's what my current code looks like:

import scrapy
import re
from icbse.items import IcbseItem


class IcbseSpider(scrapy.Spider):
    name = "icbse"
    allowed_domains = ["www.icbse.com"]
    start_urls = [
        "http://www.icbse.com/schools/",
    ]

    def parse(self, response):
        for i in xrange(20):  # I thought if i iterate the start URL,
        # I could probably have the page reload. 
        # It didn't work though.
            for href in response.xpath(
                    '//div[@class="row"]/div[3]//span[@class="list-group-item"]\
    /a/@href').extract():
                url = response.urljoin(href)
                yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        # total number of schools found on page
        pages = response.xpath(
            "//div[@class='container']/strong/text()").extract()[0]

        self.captured_schools_set = set()  # Placing the Set here doesn't work!

        while len(self.captured_schools_set) != int(pages):
            yield scrapy.Request(response.url, callback=self.reload_url)

        for school in self.captured_schools_set:
            yield scrapy.Request(school, callback=self.scrape_school_info)

    def reload_url(self, response):
        for school_href in response.xpath(
                "//h4[@class='school_name']/a/@href").extract():
            self.captured_schools_set.add(response.urljoin(school_href))

    def scrape_school_info(self, response):

        item = IcbseItem()

        try:
            item["School_Name"] = response.xpath(
                '//td[@class="tfield"]/strong/text()').extract()[0]
        except:
            item["School_Name"] = ''
            pass
        try:
            item["streetAddress"] = response.xpath(
                '//td[@class="tfield"]')[1].xpath(
                "//span[@itemprop='streetAddress']/text()").extract()[0]
        except:
            item["streetAddress"] = ''
            pass

        yield item

sergiuz · Accepted Answer

You are iterating over an empty set:

        self.captured_schools_set = set()  # Placing the Set here doesn't work!

        while len(self.captured_schools_set) != int(pages):
            yield scrapy.Request(response.url, callback=self.reload_url)

        for school in self.captured_schools_set:
            yield scrapy.Request(school, callback=self.scrape_school_info)

So, schools request are never fired.

You should reload, fire http://www.icbse.com/schools/ request with an dont_filter=True attribute, because in the default settings, scrapy filters the duplicates out.

But it appears that you are not fireing http://www.icbse.com/schools/ requests, but ( http://www.icbse.com/schools/state/andaman-nicobar ) "/state/name" requests; in the line 4 above you are firing request.url, here is a problem, change to /schools/

Reloading a page in Scrapy

Answers (1)

Related Questions