mrsteve
mrsteve

Reputation: 4142

iso-8859-1 special characters error in query result

I use the hsparql library to run a query that returns german text; and thus iso-8859-1 special characters are returned.

I wrote the result of the query to a file using writeFile, but the special characters are not correctly shown. (when viewing the file with emacs)

When I instead write the output of the show function to a file, I get the following output:

["B\195\188ro", ...]

Printing out the special character it would mean: ["Büro", ....]

How can I write special characters correctly to a file? (e.g. "Büro" is correctly show in the file output.)

EDIT: I know that show writes the escaped characters. Using writeFile directly doesn't work, I have to check the link given in hammer's answer to find a fix..

EDIT2: removed, was the wrong approach.

EDIT3: Hammers answer was right on the point. It took only 10 minutes to find the solution, but I needed to be fit and concentrated.

I looked up IO in link

The solution was (literate haskell):

> writeAllLabels = do

Running my Query (not shown, accesses the RDF TrippleStore):
>             res <- (selectStr33 (unlines qAllLabels))

>             outh <- openFile "/tmp/haskell_output.txt" WriteMode

this is the important line. If I would write "utf8" her instead of "latin1", I would get the wrong result again, i.e. as before asking the question...
>             hSetEncoding outh latin1

>             hPutStrLn outh res
>             hClose outh

Upvotes: 1

Views: 301

Answers (2)

Daniel Fischer
Daniel Fischer

Reputation: 183978

It looks as if either your database sends a UTF-8 encoded string but it is believed to be latin1 encoded, so it's getting encoded again, or the database sends UTF-8 and your locale is latin1 (or another single-byte encoding) or perhaps UCS-2/UTF-16 (if you're on Windows, it's probably the latter).

The character 'ü' is code point 252, its latin1 encoding is the byte 252 (\xFC), the UTF-8 encoding is the two-byte sequence [195,188] ([\xC3,\xBC]).

If the database sends UTF-8 and your locale is latin1, the two-byte sequence is interpreted as the two characters ü and would show as such in emacs (if the used font has the glyphs), and as "\195\188" when using show in ghci.

If the database sends UTF-8 believed to be latin1, and that is converted to UTF-8, the two bytes would be transformed into two two-byte sequences, [195,131] ([\xC3,\x83]) and [194,188] ([\xC2,\xBC]), which would in a UTF-8 locale be interpreted as the two characters ü again.

If the database sends latin1 believed to be UTF-8, the byte sequence [252,114] ([\xFC,\x72]) arising from "ür" would be an illegal byte sequence leading to an encoding error. I'm not aware of any error handling mechanism that would transform the offending 252 into [195,188], so that's unlikely to be what's happening.

To find out what's happening, look at the file in a hex editor (or use xxd if on a unixish platform) and check your locale. The solution to your problem should be setting the handles to the correct encoding, as implied by the part of the documentation @hammar linked to.

Upvotes: 2

hammar
hammar

Reputation: 139930

Don't use show if you don't want things escaped. It's meant for lightweight serialization and will escape a number of special characters as well as characters outside the ASCII range. If you use writeFile directly, it should work with the default encoding for your current locale.

For more fine-grained control over encodings, see the System.IO documentation.

Upvotes: 4

Related Questions