Reputation: 11
I am trying to write a python scraper using beautifulsoup. I successfully extracted most of the data, but I am facing now an encoding problem in the price extraction.
Here is my example:
The actual text is 1599€99
The scrapped text is:
>>>prdt.find("span",{"class":"price"}).text
u'1599\u20ac99'
"\u20ac" is supposed to be the '€' symbol using UTF-8 encoding however:
>>>prdt.find("span",{"class":"price"}).text.encode(encoding='UTF-8')
'1599\xe2\x82\xac99'
Does anyone know how to fix this issue?
Thanks.
Upvotes: 0
Views: 135
Reputation: 1116
It's representation of a unicode string. You may see its content by simply printing it.
>>> u1= u'1599\u20ac99'
>>> print u1
# prints 1599€99
>>> u2 = u'1599€99'
>>> u2
# prints u'1599\u20ac99'
Upvotes: 1
Reputation: 22992
Your script works well:
>>> prdt.find("span",{"class":"price"}).text
u'1599\u20ac99'
The return value is a valid unicode string. The character u"\u20ac" is the EURO SIGN.
If you encode this character using 'utf8' encoding you get a bytes string.
>>> u'\u20ac'.encode('utf8')
b'\xe2\x82\xac'
This is the same code point encoded in UTF-8: E2 82 AC.
See also this answer to What is Unicode, UTF-8, UTF-16?.
Upvotes: 0