atlantis
atlantis

Reputation: 3126

Character encoding issue in GHC

When I try and read a plaintext file from within my Haskell program I get:

[fromList * Exception: /path/to/file/aaa.txt hGetContents: invalid argument (Invalid or incomplete multibyte or wide character)

I googled to find this problem is usually set right by setting LANG to en_US.UTF-8 That's already how my locale looks.

Not sure if this is an issue with GHC at all.

I am on Ubuntu 11.10

Upvotes: 3

Views: 684

Answers (1)

ehird
ehird

Reputation: 40787

Are you sure aaa.txt contains valid UTF-8? If it's binary data, you need to use withBinaryFile or similar. If it is text in another encoding, you should use hSetEncoding.

For instance, if your text is in Latin-1 then you would say

hSetEncoding h latin1

where "h" is your file handle. If you are reading from standard input then its

hSetEncoding stdin latin1

There is also a mkTextEncoding function which you can use if you have read the encoding from metadata, or wish to customise the handling of invalid Unicode (although this only works on some systems).

The Unicode standards say that a Unicode parser should reject invalid strings with an error, rather than trying to fix them up. This is a deliberate rejection of Postel's Law, on the grounds of reducing security holes and inconsistent interpretations.

(You might want to consider using the text library if you'll be working with a lot of text and having to handle encoding issues; it's usually a lot faster than using Strings, since it uses an unboxed array rather than a linked list, although this means that Text values and operations on them are necessarily strict. It also lets you configure how to respond to invalid Unicode more portably and flexibly.)

Upvotes: 4

Related Questions