Reputation: 1285
I am using scrapy to scrape this website: https://www.cartrade.com/buy-used-cars/p-10
My code:
import scrapy
class ShopcluesSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['www.cartrade.com/buy-used-cars']
start_urls = ['https://www.cartrade.com/buy-used-cars/']
custom_settings = { 'FEED_URI' : 'tmp/data.csv'}
def parse(self, response):
# Extract product information
titles = response.xpath('//div[@class = "carimgblk"]/a/img/@title').extract()
URLs = response.xpath('//div[@class = "carimgblk"]/a/meta/@content').extract()
prices = response.xpath('//div[@class = "cr_prc"]/text()').extract()
for item in zip(titles,prices,URLs):
scraped_info = {
'title' : item[0].strip(),
'price' : item[1].strip().replace(',', ''),
'URL': item[2].strip(),
}
yield scraped_info
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page),callback=self.parse)
The issue is that it's not scraping all pages. I also noticed prices are not totally correct. What am I doing wrong?
Upvotes: 0
Views: 322
Reputation: 21406
Regarding parsing accuracy: the general way of parsing products out of html is finding product blocks and iterating through them and parsing every block individiaully.
In your example you can see that every car listing has it's own <div>
block with class carlistblk
:
cars = response.css('.carlistblk')
for car in cars:
item = {}
item['title'] = car.xpath('.//img/@title')
...
yield item
Your zip method can be easily disrupted if one listing doesn't have a single field, then you have 10 titles and 9 prices - the data will be zipped inaccurately.
Upvotes: 1