Recursively Scraping pages using Python (scrapy)

Question

I'm trying to make a program that retrieves the title and price of items while going to the following page.

Now all informations of the first page (title, price) are extracted but the program does not go to the next page

URL : https://scrapingclub.com/exercise/list_basic/

import scrapy
class RecursiveSpider(scrapy.Spider):
    name = 'recursive'
    allowed_domains = ['scrapingclub.com/exercise/list_basic/']
    start_urls = ['http://scrapingclub.com/exercise/list_basic//']

    def parse(self, response):
        card = response.xpath("//div[@class='card-body']")
        for thing in card:
            title = thing.xpath(".//h4[@class='card-title']").extract_first()
            price = thing.xpath(".//h5").extract_first
            yield {'price' : price, 'title' : title}
            
            
            next_page_url = response.xpath("//li[@class='page-item']//a/@href")
            if next_page_url:
                absolute_nextpage_url = response.urljoin(next_page_url)
                yield scrapy.Request(absolute_nextpage_url) ```

renatodvc · Accepted Answer

You should add the execution logs in situations like this, it would help pin point your problem.

I can see a few problems though:

        next_page_url = response.xpath("//li[@class='page-item']//a/@href")
        if next_page_url:
            absolute_nextpage_url = response.urljoin(next_page_url)

The variable next_page_url contains a selector, not a string. You need to use the .get() method to extract the string with the relative url.

After this, I executed your code it returned:

2020-09-04 15:19:34 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'scrapingclub.com':

It's filtering the request as it considers it an offisite request, even if it isn't. To fix it just use allowed_domains = ['scrapingclub.com'] or just remove this line entirely. If you want to understand more how this filter works check the source here.

Finally, it doesn't make sense to have this snippet under the for loop:

        next_page_url = response.xpath("//li[@class='page-item']//a/@href").get() # I added the .get()
        if next_page_url:
            absolute_nextpage_url = response.urljoin(next_page_url)
            yield scrapy.Request(absolute_nextpage_url)

If you use get() method it will return to next_page_url the first item (which is page 2 now, but in the next callback will be page 1, so you will never advance to page 3).
If you use getall() it will return a list, which you would need to iterate over yielding all possible requests, but this is a recursive function, so you would end up doing that in each recursion step.

The best option is to select the next button instead of the page number:

next_page_url = response.xpath('//li[@class="page-item"]/a[contains(text(), "Next")]/@href').get()

Recursively Scraping pages using Python (scrapy)

Answers (1)

Related Questions