Reputation: 53
I'm trying to select an item on page:
http://www.betterware.co.uk/catalog/product/view/id/4530/category/342/
using variations of XPath such as:
sel.xpath('//div[@class="price-box"]/span[@class="regular-price"]/span[@class="price"]/text()').extract()
the html source I'm looking at is:
<div class="price-box">
<span class="regular-price" id="product-price-4530">
<span class="price">£12.99</span>
</span>
</div>
Rather than getting the correct [u'£12.99']
, I get a bunch of other numbers that don't even appear in the page source. Scrapy shell gives:
[u'\xa312.99',
u'\xa38.99',
u'\xa38.99',
u'\xa34.49',
u'\xa34.49',
u'\xa329.99',
u'\xa329.99']
I've had no trouble selecting other items in this manner, but this and all my other price fields are suffering these mysterious results for the price text. Can someone please shed some light for me here? My python code for the items selection is:
def parse_again(self, response):
sel = Selector(response)
meta = sel.xpath('//div[@class="product-main-info"]')
items = []
for m in meta:
item = BetterItem()
item['link'] = response.url
item['item_name'] = m.select('//div[@class="product-name"]/h1/text()').extract()
item['sku'] = m.select('//p[@class="product-ids"]/text()').extract()
item['price'] = m.select('//div[@class="price-box"]/span/span/text()').extract()
items.append(item)
return items
Upvotes: 1
Views: 458
Reputation: 879729
There is nothing wrong with the result being returned by Scrapy. u'\xa3'
is the pound sign:
In [99]: import unicodedata as UD
In [100]: UD.name(u'\xa3')
Out[100]: 'POUND SIGN'
In [101]: print(u'\xa3')
£
u'\xa312.99'
is the pound sign u'\xa3
followed by the unicode u'12.99'
.
If you wish to strip the pound signs from the list, you could do this:
In [108]: data = [u'\xa312.99',
u'\xa38.99',
u'\xa38.99',
u'\xa34.49',
u'\xa34.49',
u'\xa329.99',
u'\xa329.99']
In [110]: [float(item.lstrip(u'\xa3')) for item in data]
Out[110]: [12.99, 8.99, 8.99, 4.49, 4.49, 29.99, 29.99]
The following articles are "must-reads" for anyone dealing with unicode:
and particularly for a Python-centric point of view:
Upvotes: 1