Popskully
Popskully

Reputation: 27

Python - Is it possible for scrapy to go into each product pages and scrape the data?

I am new to python and web scraping and I am wondering if it is possible to scrape from product pages with scrapy.

Example: I search for monitors on amazon.com I would like scrapy to go to each product page and scrape from there instead of just scraping the data from the search results page.

I read something about xpath but I am not sure if it is possible with that and all other resources I found seems to be doing the scraping with other things like beautiful soup etc. I correctly have a scrapy project which scrapes from a search results page but I would like to improve it to scrape from the products page.

Edit:

Here's my modified spider.py based on your suggestions:

class TestSpiderSpider(scrapy.Spider):
name = 'testscraper'
page_number = 2
start_urls = ['https://jamaicaclassifiedonline.com/auto/cars/']

def parse(self, response):
for car in response.css('.col.l3.s12.m6'):
items = scrapeItem()

product_title = response.css('.jco-card-title::text').extract()
product_link = car.css('.tooltipped valign').css('[target]::text').get()
url = response.urljoin(product_link)
yield Request(url, cb_kwargs={'product_title': product_title},callback=self.parse_car)

def parse_car(self, response, product_title):
product_description = car.css('.wysiwyg::text').get()
product_imagelink = response.css('.responsive-img img::attr(data-src)').getall()
items['product_title'] = product_title
items['product_imagelink'] = product_imagelink
items.append('items')

yield items

He's the code for items.py:

class scrapeItem(scrapy.Item):
product_title = scrapy.Field()
product_imagelink = scrapy.Field()

pass

There is currently and error when I try to run it. Seems to be relating to the

yield Request

Hopefully I am on the right track.

I also added the loop to the code.

Upvotes: 0

Views: 1301

Answers (1)

renatodvc
renatodvc

Reputation: 2564

This type of question is better answered with a case in point, where you provide your code and explain what you have already tried to do.

In a general way here is how you do that:

  • Request the search page (You mention you already did that)
  • Select the results you want, for that you can use either XPath selectors or CSS Selectors (Read more on selectors)
  • Extract the href attribute (that is the URL) of the items you want to request the product page. (This can be done with the selectors)
  • Yield a new request to the product page. If there is data you need to pass along you can use cb_kwargs (recommended) or meta. (Also a good explanation here)
  • When Scrapy get's a response for your new request it will call the parsing function (determined by the callback attribute)
  • In this parsing function you use selectors to scrape the data it interests you, build and yield your items.

To make it more clear, here is very broad example (it doesn't really work, it's meant to illustrate):

from scrapy import Request, Spider


class ExampleSpider(Spider):
    name = "example"
    start_urls = ['https://www.example.com']

    def parse(self, resposne):
        products = response.xpath('//div[@class="products"]')
        for product in products:
            product_name = product.xpath('a/text()').get()
            href = product.xpath('a/@href').get()
            url = response.urljoin(href) # This builds a full URL when href is a relative url
            yield Request(url, cb_kwargs={'product_name': product_name}, callback=self.parse_product)

    def parse_product(self, response, product_name): # Notice it will receive a new arg here, as passed in cb_kwargs
        description = response.xpath('//article[@id="desc"]//text()').getall()
        price = response.xpath('//div[@id="price"]/text()').get()
        yield {
            'product_name': product_name,
            'price': price,
            'description': description
        }

Upvotes: 4

Related Questions