Reputation: 3126
When I try and read a plaintext file from within my Haskell program I get:
[fromList * Exception: /path/to/file/aaa.txt hGetContents: invalid argument (Invalid or incomplete multibyte or wide character)
I googled to find this problem is usually set right by setting LANG to en_US.UTF-8 That's already how my locale looks.
Not sure if this is an issue with GHC at all.
I am on Ubuntu 11.10
Upvotes: 3
Views: 684
Reputation: 40787
Are you sure aaa.txt contains valid UTF-8? If it's binary data, you need to use withBinaryFile or similar. If it is text in another encoding, you should use hSetEncoding.
For instance, if your text is in Latin-1 then you would say
hSetEncoding h latin1
where "h" is your file handle. If you are reading from standard input then its
hSetEncoding stdin latin1
There is also a mkTextEncoding function which you can use if you have read the encoding from metadata, or wish to customise the handling of invalid Unicode (although this only works on some systems).
The Unicode standards say that a Unicode parser should reject invalid strings with an error, rather than trying to fix them up. This is a deliberate rejection of Postel's Law, on the grounds of reducing security holes and inconsistent interpretations.
(You might want to consider using the text library if you'll be working with a lot of text and having to handle encoding issues; it's usually a lot faster than using Strings, since it uses an unboxed array rather than a linked list, although this means that Text values and operations on them are necessarily strict. It also lets you configure how to respond to invalid Unicode more portably and flexibly.)
Upvotes: 4