Reputation: 51
I got a scraper which collects data from the website "www.bol.com", the only problem I stumbled upon so far is that the price consists of two parts (in the HTML source). The number and the fraction of the number is seperated.
I tried scraping both the price and the fraction of the price seperately. This works, however, not every product has a price with a fraction. So my scraper ended up only scraping all the products with prices which consist of a number and a fraction, and not the numbers without a fraction. Maybe I need to return the number "0" when there's no fraction available?
HTML Source for price with fraction
The price is being displayed as: 777,86
<span class="promo-price" data-test="price">777
<sup class="promo-price__fraction" data-test="price-fraction">86</sup>
</span>
HTML Source for price without fraction
The price is being displayed as 739- (I don't need the - symbol)
<span class="promo-price" data-test="price">739
<sup class="promo-price__fraction promo-price__fraction--dash" data-test="price-fraction">-</sup>
</span>
Scraper
import scrapy
from ..items import Mobile
class AmazonScraper(scrapy.Spider):
name = "bol_scraper"
# How many pages you want to scrape
no_of_pages = 1
# Headers to fix 503 service unavailable error
# Spoof headers to force servers to think that request coming from browser ;)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2840.71 Safari/539.36'}
def start_requests(self):
# starting urls for scraping
urls = ["https://www.bol.com/nl/l/apple-iphones/N/4010+4294862300/?ruleRedirect=1&sI=iphone&variants="]
for url in urls: yield scrapy.Request(url = url, callback = self.parse, headers = self.headers)
def parse(self, response):
self.no_of_pages -= 1
# print(response.text)
mobiles = response.xpath("//a[@class='product-title px_list_page_product_click']").xpath("@href").getall()
# print(len(mobiles))
for mobile in mobiles:
final_url = response.urljoin(mobile)
yield scrapy.Request(url=final_url, callback = self.parse_mobile, headers = self.headers)
# break
if(self.no_of_pages > 0):
next_page_url = response.xpath("//ul[@class='pagination']/li[@class='[ pagination__controls pagination__controls--next ]']/a").xpath("@href").get()
final_url = response.urljoin(next_page_url)
yield scrapy.Request(url = final_url, callback = self.parse, headers = self.headers)
def parse_mobile(self, response):
title = response.xpath("//span[@class='h-boxedright--xs']//text()").get() or response.xpath("//h1[@class='page-heading']//text()").get()
rawprice = '.'.join(response.css('.promo-price ::text').extract())
cleanprice = rawprice.replace('-','00').replace('\n', '').replace(' ','')
price = cleanprice[:-1]
print(title, price, price_fraction)
yield Mobile(title = title.strip(), price = price.strip(), price_fraction = price_fraction.strip())
Result before cleaning the data
{"title": "Apple iPhone 11 - 64GB - Zwart", "price": "777\n .86."},
{"title": "Apple iPhone Xs - 64GB - Goud", "price": ""},
{"title": "Apple iPhone 11 Pro Max - 256GB - Spacegrijs", "price": "1319\n .-."},
{"title": "Apple iPhone 11 Pro Max - 512GB - Goud", "price": "1589\n .-."},
{"title": "Apple iPhone 11 - 256GB - Zwart", "price": "899\n .-."},
{"title": "iPhone Xs Max - 64GB - Space Grey", "price": "849\n .-."},
{"title": "Apple iPhone Xs - 64GB - Zilver", "price": "752\n .-."},
{"title": "Apple iPhone XR - 128GB - Zwart", "price": "716\n .45."},
{"title": "Apple iPhone 11 Pro - 64GB - Middernachtgroen", "price": "1199\n .-."},
{"title": "Apple iPhone 8 - 64GB - Spacegrijs", "price": "535\n .12."},
{"title": "Apple iPhone 11 - 128GB - Zwart", "price": "833\n .-."},
{"title": "Apple iPhone XR - 64GB - Zwart", "price": "739\n .-."},
{"title": "Apple iPhone Xs - 64GB - Spacegrijs", "price": "745\n .58."},
{"title": "Apple iPhone 7 - 32GB - Spacegrijs", "price": "378\n .95."}
Result after cleaning the data
{"title": "Apple iPhone 11 - 64GB - Zwart", "price": "777.86"},
{"title": "Apple iPhone 11 Pro Max - 256GB - Spacegrijs", "price": "1319.00"},
{"title": "Apple iPhone Xs - 64GB - Goud", "price": ""},
{"title": "Apple iPhone Xs - 64GB - Zilver", "price": "752.00"},
{"title": "Apple iPhone 11 Pro Max - 512GB - Goud", "price": "1589.00"},
{"title": "Apple iPhone 11 - 256GB - Zwart", "price": "899.00"},
{"title": "iPhone Xs Max - 64GB - Space Grey", "price": "849.00"},
{"title": "Apple iPhone XR - 128GB - Zwart", "price": "716.45"},
{"title": "Apple iPhone 8 - 64GB - Spacegrijs", "price": "535.12"},
{"title": "Apple iPhone 11 Pro - 64GB - Middernachtgroen", "price": "1199.00"},
{"title": "Apple iPhone Xs - 64GB - Spacegrijs", "price": "745.58"},
{"title": "Apple iPhone 11 - 128GB - Zwart", "price": "833.00"},
{"title": "Apple iPhone XR - 64GB - Zwart", "price": "739.00"},
{"title": "Apple iPhone 7 - 32GB - Spacegrijs", "price": "378.95"}
Upvotes: 0
Views: 1610
Reputation: 1445
I believe
price = '.'.join(response.css('.promo-price ::text').extract())
Your mistake is using .get()
function which picks only first text chunk found, while there are two to extract.
Upvotes: 3
Reputation: 24930
Using your two html samples as basis, take a look at this and try to modify it to your own code:
pricing = """
<prices>
<unit>
<span class="promo-price" data-test="price">777
<sup class="promo-price__fraction" data-test="price-fraction">86</sup>
</span>
</unit>
<unit>
<span class="promo-price" data-test="price">739
<sup class="promo-price__fraction promo-price__fraction--dash" data-test="price-fraction">-</sup>
</span>
</unit>
</prices>
"""
from scrapy.selector import Selector
sel = Selector(text=pricing)
prices = sel.xpath('//span[@class="promo-price"]').extract()
for price in prices:
sel = Selector(text=price)
fract = sel.xpath('.//text()').extract()
full = sel.xpath('//span/text()').extract()
price = full[0].strip() + '.' + fract[1].strip().replace('-','00')
print(price)
Output:
777.86
739.00
Upvotes: 2