Chrisinpants
Chrisinpants

Reputation: 53

Strange XPath results in Scrapy shell

I'm trying to select an item on page:

http://www.betterware.co.uk/catalog/product/view/id/4530/category/342/

using variations of XPath such as:

sel.xpath('//div[@class="price-box"]/span[@class="regular-price"]/span[@class="price"]/text()').extract()

the html source I'm looking at is:

<div class="price-box">
    <span class="regular-price" id="product-price-4530">
        <span class="price">£12.99</span>
    </span>
</div>

Rather than getting the correct [u'£12.99'], I get a bunch of other numbers that don't even appear in the page source. Scrapy shell gives:

[u'\xa312.99',
 u'\xa38.99',
 u'\xa38.99',
 u'\xa34.49',
 u'\xa34.49',
 u'\xa329.99',
 u'\xa329.99']

I've had no trouble selecting other items in this manner, but this and all my other price fields are suffering these mysterious results for the price text. Can someone please shed some light for me here? My python code for the items selection is:

def parse_again(self, response):
    sel = Selector(response)
    meta = sel.xpath('//div[@class="product-main-info"]')
    items = []
    for m in meta:
        item = BetterItem()
        item['link'] = response.url
        item['item_name'] = m.select('//div[@class="product-name"]/h1/text()').extract()
        item['sku'] = m.select('//p[@class="product-ids"]/text()').extract()
        item['price'] = m.select('//div[@class="price-box"]/span/span/text()').extract()
        items.append(item)
    return items

Upvotes: 1

Views: 458

Answers (1)

unutbu
unutbu

Reputation: 879729

There is nothing wrong with the result being returned by Scrapy. u'\xa3' is the pound sign:

In [99]: import unicodedata as UD

In [100]: UD.name(u'\xa3')
Out[100]: 'POUND SIGN'

In [101]: print(u'\xa3')
£

u'\xa312.99' is the pound sign u'\xa3 followed by the unicode u'12.99'.

If you wish to strip the pound signs from the list, you could do this:

In [108]: data = [u'\xa312.99',
 u'\xa38.99',
 u'\xa38.99',
 u'\xa34.49',
 u'\xa34.49',
 u'\xa329.99',
 u'\xa329.99']

In [110]: [float(item.lstrip(u'\xa3')) for item in data]
Out[110]: [12.99, 8.99, 8.99, 4.49, 4.49, 29.99, 29.99]

The following articles are "must-reads" for anyone dealing with unicode:

and particularly for a Python-centric point of view:

Upvotes: 1

Related Questions