Reputation: 1693
I'm having some troubles with text encoding. Parsing a website gives me a Data.Text string
"Project - Fran\195\167ois Dubois",
which I need to write to a file. So I'm using Data.Text.Lazy.Encoding.encodeUtf8 to convert it into a Bytestring. The problem is that this yields garbled output:
"Project - François Dubois".
What am I missing here?
Upvotes: 3
Views: 326
Reputation: 183978
If you have gotten Fran\195\167ois
inside your Data.Text
, you already have a UTF-8-encoded François
.
That's inconvenient because Data.Text[.Lazy]
is supposed to be UTF-16 encoded text, and the two code units 195 and 167 are interpreted as the unicode code points 195 resp. 167 which are 'Ã' resp. '§'. If you UTF-8-encode the text, these are converted to the byte sequences c383 ([195,131])
resp c2a7 ([194,167])
.
The most likely way for getting into this situation is that the data you got from the website was UTF-8 encoded, but was interpreted as ISO-8859-1 (Latin 1) encoded (or another 8-bit encoding; 8859-15 is widespread too).
The proper way of handling it is avoiding the situation altogether [that may not be possible, unfortunately].
If the source of your data states its encoding correctly - as a website should - find out the encoding and interpret the data accordingly. If an incorrect encoding is stated, you are of course out of luck, and if no encoding is specified, you have to guess right (the natural guess nowadays is UTF-8, at least for languages using a variant of the Latin alphabet).
If avoiding the situation is not possible, the easiest ways of fixing it are
replacing the occurrences of the offending sequence with the desired one before encoding:
encodeUtf8 $ replace (pack "Fran\195\167ois") (pack "Fran\231ois") contents
assuming everything else is ASCII or inadvertent UTF-8 too, interpret the Text
code units as bytes:
Data.ByteString.Lazy.Char8.pack $ Data.Text.Lazy.unpack contents
The former is more efficient, but becomes inconvenient if there are many different misencodings (caused by different accented letters, for example). The latter works only in the assumed situation (no code units above 255 in the Text
) and is rather inefficient for long texts.
Upvotes: 5
Reputation: 1903
I am not completely sure if less
can show UTF-8 encoded characters properly. GVim can. You can check this link on SO to find out how you can view UTF-8 data in gVim.
And regarding the other issue of being able to pass this to graphviz, I think you need to set the encoding on the command-line as explained in the Graph NonAscii FAQ.
From what you are explaining, I think there are no issues with how the data is being persisted. If you pass the encoding properly to graphviz, I think your problem will be resolved.
P.S: Creating an answer since it is easier to create descriptive links
Upvotes: 0