user3351262
user3351262

Reputation: 11

How to correct the encoding of the data scraped with beautifulsoup?

I am trying to write a python scraper using beautifulsoup. I successfully extracted most of the data, but I am facing now an encoding problem in the price extraction.

Here is my example:

The actual text is 1599€99

The scrapped text is:

>>>prdt.find("span",{"class":"price"}).text
u'1599\u20ac99'

"\u20ac" is supposed to be the '€' symbol using UTF-8 encoding however:

>>>prdt.find("span",{"class":"price"}).text.encode(encoding='UTF-8')
'1599\xe2\x82\xac99'

Does anyone know how to fix this issue?

Thanks.

Upvotes: 0

Views: 135

Answers (2)

sardok
sardok

Reputation: 1116

It's representation of a unicode string. You may see its content by simply printing it.

>>> u1= u'1599\u20ac99'

>>> print u1
# prints 1599€99

>>> u2 = u'1599€99'

>>> u2
# prints u'1599\u20ac99'

Upvotes: 1

Laurent LAPORTE
Laurent LAPORTE

Reputation: 22992

Your script works well:

>>> prdt.find("span",{"class":"price"}).text
u'1599\u20ac99'

The return value is a valid unicode string. The character u"\u20ac" is the EURO SIGN.

If you encode this character using 'utf8' encoding you get a bytes string.

>>> u'\u20ac'.encode('utf8')
b'\xe2\x82\xac'

This is the same code point encoded in UTF-8: E2 82 AC.

See also this answer to What is Unicode, UTF-8, UTF-16?.

Upvotes: 0

Related Questions