scrapy - data from the following pages

Question

I have a problem. How to do to download data after moving to the next pages? It only downloads from the first page. I paste, my code:

    # -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import Request


class PronobelSpider(Spider):
    name = 'pronobel'
    allowed_domains = ['pronobel.pl']
    start_urls = ['http://pronobel.pl/praca-opieka-niemcy/']

    def parse(self, response):

        jobs = response.xpath('//*[@class="offer offer-immediate"]')
        for job in jobs:
            title = job.xpath('.//*[@class="offer-title"]/text()').extract_first()
            start_date = job.xpath('.//*[@class="offer-attr offer-departure"]/text()').extract_first()
            place = job.xpath('.//*[@class="offer-attr offer-localization"]/text()').extract_first()
            language = job.xpath('.//*[@class="offer-attr offer-salary"]/text()').extract()[1]

            print title
            print start_date
            print place
            print language

        next_page_url = response.xpath('//*[@class="page-nav nav-next"]/a/@href').extract_first()
        absolute_next_page_url = response.urljoin(next_page_url)
        yield Request(absolute_next_page_url)

I only get data from the first page

Luiz Rodrigues da Silva · Accepted Answer

Your problem is not about crawling the next page, your problem is on your selector. First of all, when selecting an element by class, it's recommended to use css. What is happening is there are no elements with the class offer-immediate on other pages.

I made a few changes on your code, see bellow:

from scrapy import Spider
from scrapy.http import Request


class PronobelSpider(Spider):
    name = 'pronobel'
    allowed_domains = ['pronobel.pl']
    start_urls = ['http://pronobel.pl/praca-opieka-niemcy/']

    def parse(self, response):
        jobs = response.css('div.offers-list div.offer')
        for job in jobs:
            title = job.css('a.offer-title::text').extract_first()
            start_date = job.css('div.offer-attr.offer-departure::text').extract_first()
            place = job.css('div.offer-attr.offer-localization::text').extract_first()
            language = job.css('div.offer-attr.offer-salary::text').extract()[1]
            yield {'title': title,
                    'start_date': start_date,
                    'place': place,
                    'language': language,
                    'url': response.url}

        next_page_url = response.css('li.page-nav.nav-next a::attr(href)').extract_first()
        absolute_next_page_url = response.urljoin(next_page_url)
        yield Request(absolute_next_page_url)

scrapy - data from the following pages

Answers (2)

Related Questions