Reputation: 17881
I used lxml to parse some web page as below:
>>> doc = lxml.html.fromstring(htmldata)
>>> element in doc.cssselect(sometag)[0]
>>> text = element.text_content()
>>> print text
u'Waldenstr\xf6m'
Why it prints u'Waldenstr\xf6m' but not "Waldenström" here?
After that, I tried to add this text to a MySQL table with UTF-8 character set and utf8_general_ci collatio, Users is a Django model:
>>> Users.objects.create(last_name=text)
'ascii' codec can't encode character u'\xf6' in position 9: ordinal not in range(128)
What I was doing wrong here? How can I get the the correct data "Waldenström" and write it to database?
Upvotes: 1
Views: 1172
Reputation: 536359
>>> print text
u'Waldenstr\xf6m'
There is a difference between displaying something in the shell (which uses the repr
) and printing it (which just spits out the string):
>>> u'Waldenstr\xf6m'
u'Waldenstr\xf6m'
>>> print u'Waldenstr\xf6m'
Waldenström
So, I'm not sure your snippet above is really what happened. If it definitely is, then your XHTML must contain exactly that string:
<div class="something">u'Waldenstr\xf6m'</div>
(maybe it was incorrectly generated by Python using a string's repr()
instead of its str()
?)
If this is right and intentional, you would need to parse that Python string literal into a simple string. One way of doing that would be:
>>> r= r"u'Waldenstr\xf6m'"
>>> print r[2:-1].decode('unicode-escape')
Waldenström
If the snippet at the top is actually not quite right and you are simply asking why Python's repr
escapes all non-ASCII characters, the answer is that printing non-ASCII to the console is unreliable across various environments so the escape is safer. In the above examples you might have received ?
s or worse instead of the ö
if you were unlucky.
In Python 3 this changes:
>>> 'Waldenstr\xf6m'
'Waldenström'
Upvotes: 0