Przemysław G.
Przemysław G.

Reputation: 3

scrapy - data from the following pages

I have a problem. How to do to download data after moving to the next pages? It only downloads from the first page. I paste, my code:

    # -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import Request


class PronobelSpider(Spider):
    name = 'pronobel'
    allowed_domains = ['pronobel.pl']
    start_urls = ['http://pronobel.pl/praca-opieka-niemcy/']

    def parse(self, response):

        jobs = response.xpath('//*[@class="offer offer-immediate"]')
        for job in jobs:
            title = job.xpath('.//*[@class="offer-title"]/text()').extract_first()
            start_date = job.xpath('.//*[@class="offer-attr offer-departure"]/text()').extract_first()
            place = job.xpath('.//*[@class="offer-attr offer-localization"]/text()').extract_first()
            language = job.xpath('.//*[@class="offer-attr offer-salary"]/text()').extract()[1]

            print title
            print start_date
            print place
            print language

        next_page_url = response.xpath('//*[@class="page-nav nav-next"]/a/@href').extract_first()
        absolute_next_page_url = response.urljoin(next_page_url)
        yield Request(absolute_next_page_url)

I only get data from the first page

Upvotes: 0

Views: 47

Answers (2)

Your problem is not about crawling the next page, your problem is on your selector. First of all, when selecting an element by class, it's recommended to use css. What is happening is there are no elements with the class offer-immediate on other pages.

I made a few changes on your code, see bellow:

from scrapy import Spider
from scrapy.http import Request


class PronobelSpider(Spider):
    name = 'pronobel'
    allowed_domains = ['pronobel.pl']
    start_urls = ['http://pronobel.pl/praca-opieka-niemcy/']

    def parse(self, response):
        jobs = response.css('div.offers-list div.offer')
        for job in jobs:
            title = job.css('a.offer-title::text').extract_first()
            start_date = job.css('div.offer-attr.offer-departure::text').extract_first()
            place = job.css('div.offer-attr.offer-localization::text').extract_first()
            language = job.css('div.offer-attr.offer-salary::text').extract()[1]
            yield {'title': title,
                    'start_date': start_date,
                    'place': place,
                    'language': language,
                    'url': response.url}

        next_page_url = response.css('li.page-nav.nav-next a::attr(href)').extract_first()
        absolute_next_page_url = response.urljoin(next_page_url)
        yield Request(absolute_next_page_url)

Upvotes: 1

Przemysław G.
Przemysław G.

Reputation: 3

I also tried this:

# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import Request


class PronobelSpider(Spider):
    name = 'pronobel'
    allowed_domains = ['pronobel.pl']
    start_urls = ['http://pronobel.pl/praca-opieka-niemcy']

    def parse(self, response):

        jobs = response.xpath('//*[@class="offer offer-immediate"]')
        for job in jobs:
            title = job.xpath('.//*[@class="offer-title"]/text()').extract_first()
            start_date = job.xpath('.//*[@class="offer-attr offer-departure"]/text()').extract_first()
            place = job.xpath('.//*[@class="offer-attr offer-localization"]/text()').extract_first()
            language = job.xpath('.//*[@class="offer-attr offer-salary"]/text()').extract()[1]

            yield {'place' : place}

        next_page_url = response.xpath('//*[@class="page-nav nav-next"]/a/@href').extract_first()
        absolute_next_page_url = response.urljoin(next_page_url)  
        yield Request(absolute_next_page_url)

Reply:

2019-03-20 17:58:28 [scrapy.core.engine] INFO: Spider opened
2019-03-20 17:58:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-20 17:58:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6025
2019-03-20 17:58:28 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://pronobel.pl/praca-opieka-niemcy> from <GET http://pronobel.pl/praca-opieka-niemcy>
2019-03-20 17:58:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pronobel.pl/praca-opieka-niemcy> (referer: None)
2019-03-20 17:58:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pronobel.pl/praca-opieka-niemcy>
{'place': u'Ratingen'}
2019-03-20 17:58:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pronobel.pl/praca-opieka-niemcy>
{'place': u'Burg Stargard'}
2019-03-20 17:58:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pronobel.pl/praca-opieka-niemcy>
{'place': u'Fahrenzhausen'}
2019-03-20 17:58:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pronobel.pl/praca-opieka-niemcy>
{'place': u'Meerbusch'}
2019-03-20 17:58:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pronobel.pl/praca-opieka-niemcy>
{'place': u'Geislingen an der Steige T\xfcrkheim/Deutschland'}
2019-03-20 17:58:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pronobel.pl/praca-opieka-niemcy?page_nr=2> (referer: https://pronobel.pl/praca-opieka-niemcy)
2019-03-20 17:58:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pronobel.pl/praca-opieka-niemcy?page_nr=3> (referer: https://pronobel.pl/praca-opieka-niemcy?page_nr=2)
2019-03-20 17:58:29 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://pronobel.pl/praca-opieka-niemcy?page_nr=3> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2019-03-20 17:58:29 [scrapy.core.engine] INFO: Closing spider (finished)

Upvotes: 0

Related Questions