Shantanu Bedajna
Shantanu Bedajna

Reputation: 591

Python Scrapy scrape data from nested pages

i have a made a scraper that scrapes data from a website that have the data nested, i mean that to get to the data page i have to click 5 links then i get to the data page where i scrape the data

For every 1st page there are multiple page 2's for every page 2's there are many page 3's and so on

so here i have a parse function for opening each page until i get to the page that has the data and add the data to the item class ad return the item.

But it is skipping a lot of links without scraping data. It is not executing the last parse_link function after 100 or so links*. Well how do i know the parse_link function is not executing ?

it is because i am printing print '\n\n', 'I AM EXECUTED !!!!' and it is not printing after 100 or so links but the code executes parse_then every time

what i want to know is am i doing it right ? is this the right aproch to scrape a website like this

here is the code

# -*- coding: utf-8 -*-
import scrapy
from urlparse import urljoin
from nothing.items import NothingItem

class Canana411Spider(scrapy.Spider):
    name = "canana411"
    allowed_domains = ["www.canada411.ca"]
    start_urls = ['http://www.canada411.ca/']

PAGE 1

    def parse(self, response):
        SET_SELECTOR = '.c411AlphaLinks.c411NoPrint ul li'
        for attr in response.css(SET_SELECTOR):
            linkse = 'a ::attr(href)'
            link = attr.css(linkse).extract_first()
            link = urljoin(response.url, link)

            yield scrapy.Request(link, callback=self.parse_next)

PAGE 2

    def parse_next(self, response):

        SET_SELECTOR = '.clearfix.c411Column.c411Column3 ul li'
        for attr in response.css(SET_SELECTOR):
            linkse = 'a ::attr(href)'
            link = attr.css(linkse).extract_first()
            link = urljoin(response.url, link)
            yield scrapy.Request(link, callback=self.parse_more)

PAGE 3

    def parse_more(self, response):

        SET_SELECTOR = '.clearfix.c411Column.c411Column3 ul li'
        for attr in response.css(SET_SELECTOR):
            linkse = 'a ::attr(href)'
            link = attr.css(linkse).extract_first()
            link = urljoin(response.url, link)
            yield scrapy.Request(link, callback=self.parse_other)

PAGE 4

    def parse_other(self, response):
        SET_SELECTOR = '.clearfix.c411Column.c411Column3 ul li'
        for attr in response.css(SET_SELECTOR):
            linkse = 'a ::attr(href)'
            link = attr.css(linkse).extract_first()
            link = urljoin(response.url, link)
            yield scrapy.Request(link, callback=self.parse_then)

PAGE 5

    def parse_then(self, response):
        SET_SELECTOR = '.c411Cities li h3 a ::attr(href)'
        link = response.css(SET_SELECTOR).extract_first()
        link = urljoin(response.url, link)
        return scrapy.Request(link, callback=self.parse_link)

PAGE 6 THE DATA PAGE

    def parse_link(self, response):
        print '\n\n', 'I AM EXECUTED !!!!'
        item = NothingItem()
        namese = '.vcard__name ::text'
        addressse = '.c411Address.vcard__address ::text'
        phse = 'span.vcard__label ::text'
        item['name'] = response.css(namese).extract_first()
        item['address'] = response.css(addressse).extract_first()
        item['phone'] = response.css(phse).extract_first()
        return item

am i doing it right, or is there is a better way that i am missing ?

Upvotes: 0

Views: 937

Answers (1)

Eugene Lisitsky
Eugene Lisitsky

Reputation: 12885

If there's no conflict (e.g. 1st page cannot contain selectors and links to 3rd and should take into consideration from any page except 2nd or something alike) I'd recommend to flatten rules to extract links. Thus one parse would be enough.

Upvotes: 1

Related Questions