Zacciep
Zacciep

Reputation: 23

Decode / encode html escaped special characters in Python

I have some text that has html escape codes in it that I am struggling to fully decode / encode to display properly with Python (ultimately in a Django application).

""Coup d'État"" being a troublesome snippet.

I have used html.unescape() to successfully unescape most of the html codes, but I am struggling with the decoding of the special characters, "É", in this example. Ideally this would display as "Coup d'État", but despite trying some decoding/encoding combinations I am getting "Coup d'Ãtat".

What is the correct way to convert ""Coup d'État"" into "Coup d'État"?

Thanks for your help, and apologies if this has been answered elsewhere. I've tried searching, but no success.

Upvotes: 0

Views: 1893

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1124348

You have a Mojibake, double-encoded data. You not only have HTML entities, your data was incorrectly decoded from bytes to text before the HTML entities were applied.

For your example, the two Ã, ‰ entities decode to the Unicode characters à and . Those two characters are also known (from the Unicode standard), as U+00C3 LATIN CAPITAL LETTER A WITH TILDE and U+2030 PER MILLE SIGN. This is typical of UTF-8 data being mis-interpreted as a Latin variant encoding (such as ISO 8859-1 or a Windows Latin codepage variant.

If we assume that the original character was meant to be É, or U+00C9 LATIN CAPITAL LETTER E WITH ACUTE, then the original would have been encoded to the bytes C3 and 89 if using UTF-8. That à (U+00C3!) shows up here is not a coincidence, it is typical of UTF-8 -> Latin variant Mojibakes to end up with such combinations. The 89 mapping tells us that the most likely candidate for the wrong encoding is the Windows CP 1252 encoding, which maps the hex value 89 to U+2030 PER MILLE SIGN.

You could manually encode to bytes then decode as the correct encoding, but the trick is to know what encoding was used incorrectly, and sometimes that mistake leads to data loss, because the CP-1252 codepage doesn't have a Unicode character mapping for 5 specific byte values. That's not a direct problem for the example in your question, but can be for other text. Manually decoding would work like this:

>>> import html
>>> broken = ""Coup d'État""
>>> html.unescape(broken)
'"Coup d\'État"'
>>> html.unescape(broken).encode("cp1252")
b'"Coup d\'\xc3\x89tat"'
>>> html.unescape(broken).encode("cp1252").decode("utf-8")
'"Coup d\'État"'

A better option is to use the special ftfy library (the name is an acronym for Fixed That For You), which uses detailed knowledge about how to recognize such mistakes and undo the damage.

ftfy also handles the HTML-entity decoding, all in one step:

>>> import ftfy
>>> ftfy.fix_text(""Coup d'État"")
'"Coup d\'État"'

The library includes sloppy variants of text codes often found in a Mojibake to help with repairing. It also encodes information about how to recognize the specific errors that a given wrong codec choice produces so it knows what to do to reverse the damage.

Upvotes: 2

Related Questions