Rehno Lindeque
Rehno Lindeque

Reputation: 4435

Avoid backslash-encoding utf8 characters when deriving Read (and Show) in Haskell

I'm having trouble parsing utf8 characters into Text when deriving a Read instance. For example, when I run the following in ghci...

> import Data.Text
> data Message = Message Text deriving (Read, Show)
> read ("Message \"→\"") :: Message
Message "\8594"

Can I do anything to keep my text inside Message utf-8 encoded? I.e. The result should be...

Message "→"

(P.S. I already receive my serialized messages as Text, but currently need to unpack to a String in order to call read. I'd love to avoid this...)

EDIT: Ah sorry, answers rightly point out that it's show not read which converts to "\8594" - is there a way to show and convert back to Text again without the backslash encoding?

Upvotes: 2

Views: 2947

Answers (1)

C. A. McCann
C. A. McCann

Reputation: 77374

To the best of my knowledge, the internal encoding used by Text (which is actually UTF-16) is consistent and not exposed directly. If you want UTF-8, you can decode/encode a Text value as appropriate. Similarly, it doesn't make sense to talk about an encoding for String, because that's just a list of Char, where each Char is a unicode code point.

Most likely, it's only the Show instance for Text displaying things differently here.

Also, keep in mind that (by consistent convention in standard libraries) read and show are expected to behave as (de-)serialization functions, with a "serialized" format that, interpreted as a Haskell expression, describes a value equivalent to the one being (de-)serialized. As such, the slash encoding with ASCII text is often preferred for being widely supported and unambiguous. If you want to display a Text value with the actual code points, show isn't what you want.


I'm not entirely clear on what you want to do with the Text--using show directly is exactly what you're trying to avoid. If you want to display text in a terminal window that's going to dictate the encoding, and you want the stuff defined in Data.Text.IO. If you need to convert to a specific encoding for whatever other reason, Data.Text.Encoding will give you an encoded ByteString (emphasis on "byte", not "string"--a ByteString is a sequence of raw bytes, not a string of characters).

If you just want to convert from Text to String and back to Text... what's wrong with the slash encoding? show is not really intended for pretty-printing output for users to read, despite many people's initial expectations otherwise.

Upvotes: 5

Related Questions