konsalex
konsalex

Reputation: 469

Scrapy Pagination Fails

Hello this is my first ever post ,

So I am trying to make a Web Spider that will follow the links in invia.cz and copy all the titles from the hotel.

import scrapy


y=0
class invia(scrapy.Spider):
    name = 'Kreta'
    start_urls = ['https://dovolena.invia.cz/?d_start_from=13.01.2017&sort=nl_sell&page=1']

    def parse(self, response):

        for x in range (1, 9):
            yield {
             'titles':response.css("#main > div > div > div > div.col.col-content > div.product-list > div > ul > li:nth-child(%d)>div.head>h2>a>span.name::text"%(x)).extract() ,
             }

        if (response.css('#main > div > div > div > div.col.col-content >   
                            div.product-list > div > p > 
                            a.next').extract_first()):
         y=y+1
         go = ["https://dovolena.invia.cz/d_start_from=13.01.2017&sort=nl_sell&page=%d" % y] 
         print go
         yield scrapy.Request(
                response.urljoin(go),
                callback=self.parse
         )

In this website pages are loading with AJAX so I change the value of the URL manually, incremented by one only if the next button appears in the page.

In the scrapy shell when I test if the button appears and the conditions everything is good but when I start the spider it only crawls the first page.

It's my first spider ever so thanks in advance.

Also the errol log Error Log1 Error Log

Upvotes: 1

Views: 209

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21436

Your usage of "global" y variable is not only peculiar but won't work either

You're using y to calculate how many times parse was called. Ideally you don't want to access anything outside of the functions scope, so you can achieve the same thing with using request.meta attribute:

def parse(self, response):
    y = response.meta.get('index', 1)  # default is page 1
    y += 1
    # ...
    #next page 
    url = 'http://example.com/?p={}'.format(y)
    yield Request(url, self.parse, meta={'index':y})

Regarding your pagination issue, your next page url css selector is incorrect since the <a> node you're selecting doesn't have a absolute href attached to it, also this issue makes your y issue obsolete. To solve this try:

def parse(self, response):
    next_page = response.css("a.next::attr(data-page)").extract_first()
    # replace "page=1" part of the url with next number
    url = re.sub('page=\d+', 'page=' + next_page, response.url)
    yield Request(url, self.parse, meta={'index':y})

EDIT: Here's the whole working spider:

import scrapy
import re


class InviaSpider(scrapy.Spider):
    name = 'invia'
    start_urls = ['https://dovolena.invia.cz/?d_start_from=13.01.2017&sort=nl_sell&page=1']

    def parse(self, response):
        names = response.css('span.name::text').extract()
        for name in names:
            yield {'name': name}

        # next page
        next_page = response.css("a.next::attr(data-page)").extract_first()
        url = re.sub('page=\d+', 'page=' + next_page, response.url)
        yield scrapy.Request(url, self.parse)

Upvotes: 1

Related Questions