Reputation:
I'm trying to make a program that retrieves the title and price of items while going to the following page.
Now all informations of the first page (title, price) are extracted but the program does not go to the next page
URL : https://scrapingclub.com/exercise/list_basic/
import scrapy
class RecursiveSpider(scrapy.Spider):
name = 'recursive'
allowed_domains = ['scrapingclub.com/exercise/list_basic/']
start_urls = ['http://scrapingclub.com/exercise/list_basic//']
def parse(self, response):
card = response.xpath("//div[@class='card-body']")
for thing in card:
title = thing.xpath(".//h4[@class='card-title']").extract_first()
price = thing.xpath(".//h5").extract_first
yield {'price' : price, 'title' : title}
next_page_url = response.xpath("//li[@class='page-item']//a/@href")
if next_page_url:
absolute_nextpage_url = response.urljoin(next_page_url)
yield scrapy.Request(absolute_nextpage_url) ```
Upvotes: 1
Views: 138
Reputation: 2564
You should add the execution logs in situations like this, it would help pin point your problem.
I can see a few problems though:
next_page_url = response.xpath("//li[@class='page-item']//a/@href")
if next_page_url:
absolute_nextpage_url = response.urljoin(next_page_url)
The variable next_page_url
contains a selector, not a string. You need to use the .get()
method to extract the string with the relative url.
After this, I executed your code it returned:
2020-09-04 15:19:34 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'scrapingclub.com': <GET https://scrapingclub.com/exercise/list_basic/?page=2>
It's filtering the request as it considers it an offisite request, even if it isn't. To fix it just use allowed_domains = ['scrapingclub.com']
or just remove this line entirely. If you want to understand more how this filter works check the source here.
Finally, it doesn't make sense to have this snippet under the for loop:
next_page_url = response.xpath("//li[@class='page-item']//a/@href").get() # I added the .get()
if next_page_url:
absolute_nextpage_url = response.urljoin(next_page_url)
yield scrapy.Request(absolute_nextpage_url)
get()
method it will return to next_page_url
the first item (which is page 2 now, but in the next callback will be page 1, so you will never advance to page 3).getall()
it will return a list, which you would need to iterate over yielding all possible requests, but this is a recursive function, so you would end up doing that in each recursion step.The best option is to select the next button instead of the page number:
next_page_url = response.xpath('//li[@class="page-item"]/a[contains(text(), "Next")]/@href').get()
Upvotes: 1