Strange XPath results in Scrapy shell

Question

I'm trying to select an item on page:

http://www.betterware.co.uk/catalog/product/view/id/4530/category/342/

using variations of XPath such as:

sel.xpath('//div[@class="price-box"]/span[@class="regular-price"]/span[@class="price"]/text()').extract()

the html source I'm looking at is:


    
        £12.99

Rather than getting the correct [u'£12.99'], I get a bunch of other numbers that don't even appear in the page source. Scrapy shell gives:

[u'\xa312.99',
 u'\xa38.99',
 u'\xa38.99',
 u'\xa34.49',
 u'\xa34.49',
 u'\xa329.99',
 u'\xa329.99']

I've had no trouble selecting other items in this manner, but this and all my other price fields are suffering these mysterious results for the price text. Can someone please shed some light for me here? My python code for the items selection is:

def parse_again(self, response):
    sel = Selector(response)
    meta = sel.xpath('//div[@class="product-main-info"]')
    items = []
    for m in meta:
        item = BetterItem()
        item['link'] = response.url
        item['item_name'] = m.select('//div[@class="product-name"]/h1/text()').extract()
        item['sku'] = m.select('//p[@class="product-ids"]/text()').extract()
        item['price'] = m.select('//div[@class="price-box"]/span/span/text()').extract()
        items.append(item)
    return items

unutbu · Accepted Answer

There is nothing wrong with the result being returned by Scrapy. u'\xa3' is the pound sign:

In [99]: import unicodedata as UD

In [100]: UD.name(u'\xa3')
Out[100]: 'POUND SIGN'

In [101]: print(u'\xa3')
£

u'\xa312.99' is the pound sign u'\xa3 followed by the unicode u'12.99'.

If you wish to strip the pound signs from the list, you could do this:

In [108]: data = [u'\xa312.99',
 u'\xa38.99',
 u'\xa38.99',
 u'\xa34.49',
 u'\xa34.49',
 u'\xa329.99',
 u'\xa329.99']

In [110]: [float(item.lstrip(u'\xa3')) for item in data]
Out[110]: [12.99, 8.99, 8.99, 4.49, 4.49, 29.99, 29.99]

The following articles are "must-reads" for anyone dealing with unicode:

The Absolute Minimum Every Software Developer Must Know About Unicode

and particularly for a Python-centric point of view:

Strange XPath results in Scrapy shell

Answers (1)

Related Questions