Duoramy
Duoramy

Reputation: 25

Python: Unicode to html entities

I am having problems converting unicode to html entities.

Here is my current code:

>> name = u'\xc3\xa1\xc3\xa1\xc3\xa1\xc3\xa1'

>> entities = name.encode('ascii', 'xmlcharrefreplace')

>> print str(entities)
áááá

Each \xc3\xa1 = á (multibyte character), but when I convert it to entities, I get 2 entities for a single character.

Upvotes: 1

Views: 3480

Answers (1)

Lukas Graf
Lukas Graf

Reputation: 32560

\xc3\xa1 is á in UTF-8, not in Unicode.

(áááá in Unicode would be u'\xe1\xe1\xe1\xe1')

You therefore need to use a string literal to define it, not an unicode literal ('' vs u''). Once you got UTF-8, you need to decode that to Unicode, in other to encode it again to ASCII with XML entities:

>>> name = '\xc3\xa1\xc3\xa1\xc3\xa1\xc3\xa1'.decode('utf-8')
>>> name.encode('ascii', 'xmlcharrefreplace')
'áááá'

Upvotes: 8

Related Questions