thkang
thkang

Reputation: 11543

What is the proper way to handle string encoding in haskell?

I'm on windows w/ codepage 949.. and Excel and Notepad.exe will happily save files with cp949 encoding.

On python dealing with them isn't a pain - with str.encode and str.decode.

recently I've discovered Haskell, and it seems like there is more than one way to manipulate strings. real world haskell tells me to use ByteString for efficient IO, but I don't see a way to switch between encodings I use.

I have to read files that are not in UTF8 encoding, and write them back in their original encoding. most of them will be cp949.

Internally my haskell source will be in utf8.

It wasn't that hard in python, with the principle of str for IO, unicodefor processing, but on haskell they even lack built-in cp949 support.

so the question is - how do I do IO over files in various encodings? I have to read, convert, process, and write them.


edit:

I tried both options and .. it seems the state of text conversion on windows is abysmal.

text-icu

pros:

cons:

iconv

pros:

cons:

Upvotes: 4

Views: 1110

Answers (2)

Danny Navarro
Danny Navarro

Reputation: 2763

You can use Convert module of the text-icu package for encodings not directly supported by text.

Assuming you already got the encoded ByteString, you would do something like this:

import qualified Data.Text.ICU.Convert as Convert

decodeCP949 :: ByteString -> IO Text
decodeCP949 bs = do
    conv <- Convert.open "cp949" Nothing
    return $ Convert.toUnicode conv bs

encodeCP949 :: Text -> IO ByteString
encodeCP949 t = do
    conv <- Convert.open "cp949" Nothing
    return $ Convert.fromUnicode conv t

The IO here is a bit annoying here. I think this is a case where using unsafePerfomIO would to obtain the converter once would be alright.

Upvotes: 4

ErikR
ErikR

Reputation: 52057

You can use the Codec.Text.IConv module in the iconv package:

http://hackage.haskell.org/package/iconv-0.4.1.2/docs/Codec-Text-IConv.html

The convert function will convert from one encoding to another, so you can convert a CP949 ByteString to a UTF8 ByteString (and then to Text if you want.)

And you can also reverse the process (Text -> UTF8 ByteString -> CP949 ByteString)

Here is some example code I found on github:

https://github.com/wookay/da/blob/master/haskell/fun/test_encode.hs

Upvotes: 4

Related Questions