Troubles with text encoding

Question

I'm having some troubles with text encoding. Parsing a website gives me a Data.Text string

"Project - Fran\195\167ois Dubois",

which I need to write to a file. So I'm using Data.Text.Lazy.Encoding.encodeUtf8 to convert it into a Bytestring. The problem is that this yields garbled output:

"Project - FranÃ§ois Dubois".

What am I missing here?

Daniel Fischer · Accepted Answer

If you have gotten Fran\195\167ois inside your Data.Text, you already have a UTF-8-encoded François.

That's inconvenient because Data.Text[.Lazy] is supposed to be UTF-16 encoded text, and the two code units 195 and 167 are interpreted as the unicode code points 195 resp. 167 which are 'Ã' resp. '§'. If you UTF-8-encode the text, these are converted to the byte sequences c383 ([195,131]) resp c2a7 ([194,167]).

The most likely way for getting into this situation is that the data you got from the website was UTF-8 encoded, but was interpreted as ISO-8859-1 (Latin 1) encoded (or another 8-bit encoding; 8859-15 is widespread too).

The proper way of handling it is avoiding the situation altogether [that may not be possible, unfortunately].

If the source of your data states its encoding correctly - as a website should - find out the encoding and interpret the data accordingly. If an incorrect encoding is stated, you are of course out of luck, and if no encoding is specified, you have to guess right (the natural guess nowadays is UTF-8, at least for languages using a variant of the Latin alphabet).

If avoiding the situation is not possible, the easiest ways of fixing it are

replacing the occurrences of the offending sequence with the desired one before encoding:
```
encodeUtf8 $ replace (pack "Fran\195\167ois") (pack "Fran\231ois") contents
```
assuming everything else is ASCII or inadvertent UTF-8 too, interpret the Text code units as bytes:
```
Data.ByteString.Lazy.Char8.pack $ Data.Text.Lazy.unpack contents
```

The former is more efficient, but becomes inconvenient if there are many different misencodings (caused by different accented letters, for example). The latter works only in the assumed situation (no code units above 255 in the Text) and is rather inefficient for long texts.

Troubles with text encoding

Answers (2)

Related Questions