Scrapy: How to scrape a product price which consists of two parts?

Question

I got a scraper which collects data from the website "www.bol.com", the only problem I stumbled upon so far is that the price consists of two parts (in the HTML source). The number and the fraction of the number is seperated.

I tried scraping both the price and the fraction of the price seperately. This works, however, not every product has a price with a fraction. So my scraper ended up only scraping all the products with prices which consist of a number and a fraction, and not the numbers without a fraction. Maybe I need to return the number "0" when there's no fraction available?

HTML Source for price with fraction

The price is being displayed as: 777,86

777
  ⁸⁶

Source: https://www.bol.com/nl/p/apple-iphone-11-64gb-zwart/9200000119815601/?bltgh=j0qmNqgyvLwjCGWgGxxPmA.1_30.31.ProductTitle

HTML Source for price without fraction

The price is being displayed as 739- (I don't need the - symbol)

739
  ^-

Source: https://www.bol.com/nl/p/apple-iphone-xr-64gb-zwart/9200000098453451/?bltgh=j0qmNqgyvLwjCGWgGxxPmA.1_30.33.ProductTitle

Scraper

import scrapy
from ..items import Mobile

class AmazonScraper(scrapy.Spider):
    name = "bol_scraper"

    # How many pages you want to scrape
    no_of_pages = 1

    # Headers to fix 503 service unavailable error
    # Spoof headers to force servers to think that request coming from browser ;)
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2840.71 Safari/539.36'}

    def start_requests(self):
        # starting urls for scraping
        urls = ["https://www.bol.com/nl/l/apple-iphones/N/4010+4294862300/?ruleRedirect=1&sI=iphone&variants="]

        for url in urls: yield scrapy.Request(url = url, callback = self.parse, headers = self.headers)

    def parse(self, response):

        self.no_of_pages -= 1

        # print(response.text)

        mobiles = response.xpath("//a[@class='product-title px_list_page_product_click']").xpath("@href").getall()

        # print(len(mobiles))

        for mobile in mobiles:
            final_url = response.urljoin(mobile)
            yield scrapy.Request(url=final_url, callback = self.parse_mobile, headers = self.headers)
            # break

        if(self.no_of_pages > 0):
            next_page_url = response.xpath("//ul[@class='pagination']/li[@class='[ pagination__controls pagination__controls--next ]']/a").xpath("@href").get()
            final_url = response.urljoin(next_page_url)
            yield scrapy.Request(url = final_url, callback = self.parse, headers = self.headers)

    def parse_mobile(self, response):
        title = response.xpath("//span[@class='h-boxedright--xs']//text()").get() or response.xpath("//h1[@class='page-heading']//text()").get()
        rawprice = '.'.join(response.css('.promo-price ::text').extract())
        cleanprice = rawprice.replace('-','00').replace('
', '').replace(' ','')
        price = cleanprice[:-1]


        print(title, price, price_fraction)


        yield Mobile(title = title.strip(), price = price.strip(), price_fraction = price_fraction.strip())

Result before cleaning the data

{"title": "Apple iPhone 11 - 64GB - Zwart", "price": "777
  .86."},
{"title": "Apple iPhone Xs - 64GB - Goud", "price": ""},
{"title": "Apple iPhone 11 Pro Max - 256GB - Spacegrijs", "price": "1319
  .-."},
{"title": "Apple iPhone 11 Pro Max - 512GB - Goud", "price": "1589
  .-."},
{"title": "Apple iPhone 11 - 256GB - Zwart", "price": "899
  .-."},
{"title": "iPhone Xs Max - 64GB - Space Grey", "price": "849
  .-."},
{"title": "Apple iPhone Xs - 64GB - Zilver", "price": "752
  .-."},
{"title": "Apple iPhone XR - 128GB -  Zwart", "price": "716
  .45."},
{"title": "Apple iPhone 11 Pro - 64GB - Middernachtgroen", "price": "1199
  .-."},
{"title": "Apple iPhone 8 - 64GB - Spacegrijs", "price": "535
  .12."},
{"title": "Apple iPhone 11 - 128GB - Zwart", "price": "833
  .-."},
{"title": "Apple iPhone XR - 64GB - Zwart", "price": "739
  .-."},
{"title": "Apple iPhone Xs - 64GB - Spacegrijs", "price": "745
  .58."},
{"title": "Apple iPhone 7 - 32GB - Spacegrijs", "price": "378
  .95."}

Result after cleaning the data

{"title": "Apple iPhone 11 - 64GB - Zwart", "price": "777.86"},
{"title": "Apple iPhone 11 Pro Max - 256GB - Spacegrijs", "price": "1319.00"},
{"title": "Apple iPhone Xs - 64GB - Goud", "price": ""},
{"title": "Apple iPhone Xs - 64GB - Zilver", "price": "752.00"},
{"title": "Apple iPhone 11 Pro Max - 512GB - Goud", "price": "1589.00"},
{"title": "Apple iPhone 11 - 256GB - Zwart", "price": "899.00"},
{"title": "iPhone Xs Max - 64GB - Space Grey", "price": "849.00"},
{"title": "Apple iPhone XR - 128GB -  Zwart", "price": "716.45"},
{"title": "Apple iPhone 8 - 64GB - Spacegrijs", "price": "535.12"},
{"title": "Apple iPhone 11 Pro - 64GB - Middernachtgroen", "price": "1199.00"},
{"title": "Apple iPhone Xs - 64GB - Spacegrijs", "price": "745.58"},
{"title": "Apple iPhone 11 - 128GB - Zwart", "price": "833.00"},
{"title": "Apple iPhone XR - 64GB - Zwart", "price": "739.00"},
{"title": "Apple iPhone 7 - 32GB - Spacegrijs", "price": "378.95"}

Michael Savchenko · Accepted Answer

I believe

price = '.'.join(response.css('.promo-price ::text').extract())

Your mistake is using .get() function which picks only first text chunk found, while there are two to extract.

Scrapy: How to scrape a product price which consists of two parts?

Answers (2)

Related Questions