Reputation: 2739
I'm very new to scrapy and I've been trying to scrape http://www.icbse.com/schools/state/maharashtra, but i've run into a problem. Out of the total number of school links displayed as available, the page only displays 50 at a time in an unordered fashion.
However, if the page is reloaded, it shows 50 newish listings of school links. Some of them are different from the first links before the refresh, while some stay the same.
What i wanted to do was to add the links to a Set()
and once the len(set)
reaches the length of the total schools, i wanted to send that Set
to a parse function.
I don't understand two things about getting around this problem.
set
that would preserve the links and not get refreshed every time parse() is called.Here's what my current code looks like:
import scrapy
import re
from icbse.items import IcbseItem
class IcbseSpider(scrapy.Spider):
name = "icbse"
allowed_domains = ["www.icbse.com"]
start_urls = [
"http://www.icbse.com/schools/",
]
def parse(self, response):
for i in xrange(20): # I thought if i iterate the start URL,
# I could probably have the page reload.
# It didn't work though.
for href in response.xpath(
'//div[@class="row"]/div[3]//span[@class="list-group-item"]\
/a/@href').extract():
url = response.urljoin(href)
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
# total number of schools found on page
pages = response.xpath(
"//div[@class='container']/strong/text()").extract()[0]
self.captured_schools_set = set() # Placing the Set here doesn't work!
while len(self.captured_schools_set) != int(pages):
yield scrapy.Request(response.url, callback=self.reload_url)
for school in self.captured_schools_set:
yield scrapy.Request(school, callback=self.scrape_school_info)
def reload_url(self, response):
for school_href in response.xpath(
"//h4[@class='school_name']/a/@href").extract():
self.captured_schools_set.add(response.urljoin(school_href))
def scrape_school_info(self, response):
item = IcbseItem()
try:
item["School_Name"] = response.xpath(
'//td[@class="tfield"]/strong/text()').extract()[0]
except:
item["School_Name"] = ''
pass
try:
item["streetAddress"] = response.xpath(
'//td[@class="tfield"]')[1].xpath(
"//span[@itemprop='streetAddress']/text()").extract()[0]
except:
item["streetAddress"] = ''
pass
yield item
Upvotes: 2
Views: 2131
Reputation: 5529
You are iterating over an empty set:
self.captured_schools_set = set() # Placing the Set here doesn't work!
while len(self.captured_schools_set) != int(pages):
yield scrapy.Request(response.url, callback=self.reload_url)
for school in self.captured_schools_set:
yield scrapy.Request(school, callback=self.scrape_school_info)
So, school
s request are never fired.
You should reload, fire http://www.icbse.com/schools/ request with an dont_filter=True attribute, because in the default settings, scrapy filters the duplicates out.
But it appears that you are not fireing http://www.icbse.com/schools/ requests, but ( http://www.icbse.com/schools/state/andaman-nicobar ) "/state/name" requests; in the line 4 above you are firing request.url, here is a problem, change to /schools/
Upvotes: 2