user7730188
user7730188

Reputation:

Trouble with Scrapy Regex on one site likely not using normal encoding

So for the 20+ site i have this spider crawling through, all of the price items yield just fine... however, on this one specific site (https://www.garrafeiranacional.com/) there is a very annoying problem..

When I go to extract the price information from specific products, what is naturally returned without any MapCompose/Regex cleansing is something like this:

Before you ask I have tried just about every combination i can think of, normally I would do something like:

productLoader.add_xpath('blah', 'blah', MapCompose(lambda i: i.lstrip(punctuation)
    .strip().replace('"', '').replace('.', ','), re = '[^\\d]+'))

even trying regexes like:

I even tried using re.sub() within the MapCompose, like :

So I assume that the issue is the encoding this website uses, as one bit of info I was scraping on the site, which had a space in it, would yield the same weird \xa0 string... I tried stripping away whitespace from the price, but nothing appears to be doing the trick. if anyone has any ideas on where I should look, etc. would love to hear them

Upvotes: 0

Views: 76

Answers (1)

paul trmbrth
paul trmbrth

Reputation: 20748

\xa0 is simply a non-breaking space.

On this page for example, here's some HTML containing the price value:

<div class="price-box price-final_price" data-role="priceBox" data-product-id="19815">
<span class="price-container price-final_price tax weee"
         itemprop="offers" itemscope itemtype="http://schema.org/Offer">
        <span  id="product-price-19815"                data-price-amount="7490"
        data-price-type="finalPrice"
        class="price-wrapper ">
        <span class="price">7 490,00 €</span>    </span>
                <meta itemprop="price" content="7490" />
        <meta itemprop="priceCurrency" content="EUR" />
    </span>
</div>

If you choose to use <span class="price">7 490,00 €</span> to get the price, you can simply replace '\xa0' with ' ' or the empty string:

$ scrapy shell https://www.garrafeiranacional.com/catalog/product/view/id/19815/s/1945-petrus-tinto/category/361/
2017-07-21 10:20:42 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
(...)
2017-07-21 10:20:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.garrafeiranacional.com/catalog/product/view/id/19815/s/1945-petrus-tinto/category/361/> (referer: None)

>>> response.css('span.price').get()
'<span class="price">7\xa0490,00\xa0€</span>'
>>> response.css('span.price::text').get()
'7\xa0490,00\xa0€'

>>> response.css('span.price::text').get().replace('\u00A0', '')
'7490,00€'

Another option, that is probably easier to digest in your program, is to use the other locations of that price information in the page. In that same HTML snippet above, you can see:

    <meta itemprop="price" content="7490" />
    <meta itemprop="priceCurrency" content="EUR" />

It is also in the <head> part:

<meta property="product:price:amount" content="7490"/>
<meta property="product:price:currency" content="EUR"/>

Upvotes: 1

Related Questions