Bamieschijf
Bamieschijf

Reputation: 51

Scrapy: How to scrape a product price which consists of two parts?

I got a scraper which collects data from the website "www.bol.com", the only problem I stumbled upon so far is that the price consists of two parts (in the HTML source). The number and the fraction of the number is seperated.

I tried scraping both the price and the fraction of the price seperately. This works, however, not every product has a price with a fraction. So my scraper ended up only scraping all the products with prices which consist of a number and a fraction, and not the numbers without a fraction. Maybe I need to return the number "0" when there's no fraction available?

HTML Source for price with fraction

The price is being displayed as: 777,86

<span class="promo-price" data-test="price">777
  <sup class="promo-price__fraction" data-test="price-fraction">86</sup>
</span>

Source: https://www.bol.com/nl/p/apple-iphone-11-64gb-zwart/9200000119815601/?bltgh=j0qmNqgyvLwjCGWgGxxPmA.1_30.31.ProductTitle

HTML Source for price without fraction

The price is being displayed as 739- (I don't need the - symbol)

<span class="promo-price" data-test="price">739
  <sup class="promo-price__fraction  promo-price__fraction--dash" data-test="price-fraction">-</sup>
</span>

Source: https://www.bol.com/nl/p/apple-iphone-xr-64gb-zwart/9200000098453451/?bltgh=j0qmNqgyvLwjCGWgGxxPmA.1_30.33.ProductTitle

Scraper

import scrapy
from ..items import Mobile

class AmazonScraper(scrapy.Spider):
    name = "bol_scraper"

    # How many pages you want to scrape
    no_of_pages = 1

    # Headers to fix 503 service unavailable error
    # Spoof headers to force servers to think that request coming from browser ;)
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2840.71 Safari/539.36'}

    def start_requests(self):
        # starting urls for scraping
        urls = ["https://www.bol.com/nl/l/apple-iphones/N/4010+4294862300/?ruleRedirect=1&sI=iphone&variants="]

        for url in urls: yield scrapy.Request(url = url, callback = self.parse, headers = self.headers)

    def parse(self, response):

        self.no_of_pages -= 1

        # print(response.text)

        mobiles = response.xpath("//a[@class='product-title px_list_page_product_click']").xpath("@href").getall()

        # print(len(mobiles))

        for mobile in mobiles:
            final_url = response.urljoin(mobile)
            yield scrapy.Request(url=final_url, callback = self.parse_mobile, headers = self.headers)
            # break

        if(self.no_of_pages > 0):
            next_page_url = response.xpath("//ul[@class='pagination']/li[@class='[ pagination__controls pagination__controls--next ]']/a").xpath("@href").get()
            final_url = response.urljoin(next_page_url)
            yield scrapy.Request(url = final_url, callback = self.parse, headers = self.headers)

    def parse_mobile(self, response):
        title = response.xpath("//span[@class='h-boxedright--xs']//text()").get() or response.xpath("//h1[@class='page-heading']//text()").get()
        rawprice = '.'.join(response.css('.promo-price ::text').extract())
        cleanprice = rawprice.replace('-','00').replace('\n', '').replace(' ','')
        price = cleanprice[:-1]


        print(title, price, price_fraction)


        yield Mobile(title = title.strip(), price = price.strip(), price_fraction = price_fraction.strip())

Result before cleaning the data

{"title": "Apple iPhone 11 - 64GB - Zwart", "price": "777\n  .86."},
{"title": "Apple iPhone Xs - 64GB - Goud", "price": ""},
{"title": "Apple iPhone 11 Pro Max - 256GB - Spacegrijs", "price": "1319\n  .-."},
{"title": "Apple iPhone 11 Pro Max - 512GB - Goud", "price": "1589\n  .-."},
{"title": "Apple iPhone 11 - 256GB - Zwart", "price": "899\n  .-."},
{"title": "iPhone Xs Max - 64GB - Space Grey", "price": "849\n  .-."},
{"title": "Apple iPhone Xs - 64GB - Zilver", "price": "752\n  .-."},
{"title": "Apple iPhone XR - 128GB -  Zwart", "price": "716\n  .45."},
{"title": "Apple iPhone 11 Pro - 64GB - Middernachtgroen", "price": "1199\n  .-."},
{"title": "Apple iPhone 8 - 64GB - Spacegrijs", "price": "535\n  .12."},
{"title": "Apple iPhone 11 - 128GB - Zwart", "price": "833\n  .-."},
{"title": "Apple iPhone XR - 64GB - Zwart", "price": "739\n  .-."},
{"title": "Apple iPhone Xs - 64GB - Spacegrijs", "price": "745\n  .58."},
{"title": "Apple iPhone 7 - 32GB - Spacegrijs", "price": "378\n  .95."}

Result after cleaning the data

{"title": "Apple iPhone 11 - 64GB - Zwart", "price": "777.86"},
{"title": "Apple iPhone 11 Pro Max - 256GB - Spacegrijs", "price": "1319.00"},
{"title": "Apple iPhone Xs - 64GB - Goud", "price": ""},
{"title": "Apple iPhone Xs - 64GB - Zilver", "price": "752.00"},
{"title": "Apple iPhone 11 Pro Max - 512GB - Goud", "price": "1589.00"},
{"title": "Apple iPhone 11 - 256GB - Zwart", "price": "899.00"},
{"title": "iPhone Xs Max - 64GB - Space Grey", "price": "849.00"},
{"title": "Apple iPhone XR - 128GB -  Zwart", "price": "716.45"},
{"title": "Apple iPhone 8 - 64GB - Spacegrijs", "price": "535.12"},
{"title": "Apple iPhone 11 Pro - 64GB - Middernachtgroen", "price": "1199.00"},
{"title": "Apple iPhone Xs - 64GB - Spacegrijs", "price": "745.58"},
{"title": "Apple iPhone 11 - 128GB - Zwart", "price": "833.00"},
{"title": "Apple iPhone XR - 64GB - Zwart", "price": "739.00"},
{"title": "Apple iPhone 7 - 32GB - Spacegrijs", "price": "378.95"}

Upvotes: 0

Views: 1610

Answers (2)

Michael Savchenko
Michael Savchenko

Reputation: 1445

I believe

price = '.'.join(response.css('.promo-price ::text').extract())

Your mistake is using .get() function which picks only first text chunk found, while there are two to extract.

Upvotes: 3

Jack Fleeting
Jack Fleeting

Reputation: 24930

Using your two html samples as basis, take a look at this and try to modify it to your own code:

pricing = """
<prices>
  <unit>
  <span class="promo-price" data-test="price">777
  <sup class="promo-price__fraction" data-test="price-fraction">86</sup>
</span>
  </unit>
  <unit>
    <span class="promo-price" data-test="price">739
  <sup class="promo-price__fraction  promo-price__fraction--dash" data-test="price-fraction">-</sup>
</span>
  </unit>
</prices>

"""

from scrapy.selector import Selector
sel = Selector(text=pricing)
prices = sel.xpath('//span[@class="promo-price"]').extract()
for price in prices:
    sel = Selector(text=price)    
    fract = sel.xpath('.//text()').extract()
    full = sel.xpath('//span/text()').extract()
    price = full[0].strip() + '.' + fract[1].strip().replace('-','00')
    print(price)

Output:

777.86
739.00

Upvotes: 2

Related Questions