Adam
Adam

Reputation: 1285

Scraping multiple pages using scrapy

I am using scrapy to scrape this website: https://www.cartrade.com/buy-used-cars/p-10

My code:

import scrapy

class ShopcluesSpider(scrapy.Spider):
   name = 'example'
   allowed_domains = ['www.cartrade.com/buy-used-cars']
   start_urls = ['https://www.cartrade.com/buy-used-cars/']
   custom_settings = { 'FEED_URI' : 'tmp/data.csv'}

   def parse(self, response):
       # Extract product information
       titles = response.xpath('//div[@class = "carimgblk"]/a/img/@title').extract()
       URLs = response.xpath('//div[@class = "carimgblk"]/a/meta/@content').extract()
       prices = response.xpath('//div[@class = "cr_prc"]/text()').extract()

       for item in zip(titles,prices,URLs):
           scraped_info = {
               'title' : item[0].strip(),
               'price' : item[1].strip().replace(',', ''),
               'URL': item[2].strip(),
           }    
           yield scraped_info

       next_page = response.css('li.next a::attr(href)').extract_first()
       if next_page:
          yield scrapy.Request(response.urljoin(next_page),callback=self.parse)

The issue is that it's not scraping all pages. I also noticed prices are not totally correct. What am I doing wrong?

Upvotes: 0

Views: 322

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21406

Regarding parsing accuracy: the general way of parsing products out of html is finding product blocks and iterating through them and parsing every block individiaully.

In your example you can see that every car listing has it's own <div> block with class carlistblk:

cars = response.css('.carlistblk')
for car in cars:
    item = {}
    item['title'] = car.xpath('.//img/@title')
    ...
    yield item

Your zip method can be easily disrupted if one listing doesn't have a single field, then you have 10 titles and 9 prices - the data will be zipped inaccurately.

Upvotes: 1

Related Questions