sophocles
sophocles

Reputation: 13821

Dealing with pagination when using scrapy-selenium (POST request)

I am trying to scrape the following website:

https://www.getwines.com/main.asp?request=search&type=w&s1=s9818865857&fbclid=IwAR3yF9x1X7sdPYgsfl4vF1oNF7GNoF1pSov4lwJLEeeTYFGevBTfRKOPBmo

I am successful in scraping the first page, but I have trouble going to the next pages. There are two reasons for this:

  1. When inspecting the next_page button I don't get a relative or an absolute URL. Instead I get JavaScript:getPage(2) which I can't use to follow links

  2. The next page button link can be accessed via (//table[@class='tbl_pagination']//a//@href)[11] when being on the first page, but from the 2nd page and onwards, the next page button is the 12th item, i.e. (//table[@class='tbl_pagination']//a//@href)[12]

So ultimately my question is, how do I effectively go to ALL the subsequent pages and scrape the data.

This is probably very simple to solve, but I am a beginner in web scraping so any feedback is appreciated. Please see below my code.

Thanks for your help.

**
import scrapy
from scrapy_selenium import SeleniumRequest
class WinesSpider(scrapy.Spider):
    name = 'wines'
  
    def start_requests(self):
        yield SeleniumRequest(
        url='https://www.getwines.com/category_Wine',
        wait_time=3,
        callback=self.parse
        )
    def parse(self, response):
        products = response.xpath("(//div[@class='layMain']//tbody)[5]/tr ")
        for product in products:
            yield {
                'product_name': 
                product.xpath(".//a[@class='Srch-producttitle']/text()").get(),
                'product_link': 
                product.xpath(".//a[@class='Srch-producttitle']/@href").get(),
                'product_actual_price': 
                product.xpath(".//td//td[3]//td/span[2]/text()").get(),
                'product_price_onsale': 
                product.xpath(".//td//td[3]//td/span[4]/text()").get()
            }
    #next_page = response.xpath("(//table[@class='tbl_pagination']//a//@href)[11]").get()
    #if next_page:
    #    absolute_url = f"'https://www.getwines.com/category_Wine"**

Upvotes: 2

Views: 738

Answers (1)

sophocles
sophocles

Reputation: 13821

Please see below the code that answers the above question.

In a nutshell, I changed the structure of the code and it now works perfectly. Some remarks:

  1. Firstly, save all the content of the pages in a list
  2. It is important to use the "except NoSuchElementException" at the end of the while-try loop --> Before adding this, the code kept failing as it did not know what to do once the last page was reached.
  3. Access the content of the stored links (responses).

All in all, I think structuring your Scrapy code this way works well when integrating Selenium with Scrapy. However, as I am a beginner with Web Scraping, any additional feedback on how to integrate Selenium with Scrapy efficiently will be appreciated.

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
from selenium.common.exceptions import NoSuchElementException

class WinesSpider(scrapy.Spider):
    name = 'wines'

    responses = []

    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.getwines.com/category_Wine',
            callback=self.parse
        )

    def parse(self, response):
        driver = response.meta['driver']
        intial_page = driver.page_source
        self.responses.append(intial_page)
        found = True
        while found:
            try:
                next_page = driver.find_element_by_xpath("//b[text()= '>>']/parent::a")
                href = next_page.get_attribute('href')
                driver.execute_script(href)
                driver.implicitly_wait(2)
                self.responses.append(driver.page_source)
        
            except NoSuchElementException:
                break

        for resp in self.responses:
            r = Selector(text=resp)
            products = r.xpath("(//div[@class='layMain']//tbody)[5]/tr")
            for product in products:
                yield {
                    'product_name':
                    product.xpath(".//a[@class='Srch-producttitle']/text()").get(),
                    'product_link':
                    product.xpath(".//a[@class='Srch-producttitle']/@href").get(),
                    'product_actual_price':
                    product.xpath(".//span[@class='RegularPrice']/text()").get(),
                    'product_price_onsale':
                    product.xpath(".//td//td[3]//td/span[4]/text()").get()
                }

Upvotes: 1

Related Questions