Jogger
Jogger

Reputation: 1699

Serializing a Data.Text value to a ByteString without unnecessary \NUL bytes

With the following code, I want to serialize a Data.Text value to a ByteString. Unfortunately my text is prepended with unnecessary NUL bytes and an EOT byte:

GHCi, version 9.4.4: https://www.haskell.org/ghc/  :? for help
ghci> import qualified Data.Text as T
ghci> import Data.Binary
ghci> import Data.Binary.Put
ghci> let txt = T.pack "Text"
ghci> runPut $ put txt
"\NUL\NUL\NUL\NUL\NUL\NUL\NUL\EOTText"
ghci>

Questions:

PS: I the real code I put the length in front of the text

    foo :: Text -> ByteString
    foo txt = runPut do
        putWord32host $ T.length txt
        put txt

Upvotes: 6

Views: 84

Answers (1)

willeM_ Van Onsem
willeM_ Van Onsem

Reputation: 476493

It actually already encodes the length in the binary string. Indeed, if we look at the source code, for the Text instance of Binary, we see [src]:

instance Binary Text where
    put t = put (encodeUtf8 t)
    get   = do
      bs <- get
      case decodeUtf8' bs of
        P.Left exn -> P.fail (P.show exn)
        P.Right a -> P.return a

That's not much of a surprise, we encode it to UTF-8 which produces a ByteString, and then use put on that one. But the length is added when we put the ByteString itself. Indeed, the BinaryString instance of Binary looks like [src]:

instance Binary B.ByteString where
    put bs = put (B.length bs)
             <> putByteString bs
    get    = get >>= getByteString

The put for the ByteString produced by encodeUtf8 thus writes eight bytes to specify the size of the ByteString, this is thus the number of bytes, not (per se the same as) the number of characters in the Text.

If you would want the same effect, but without the length prefix, you can use:

import Data.Text.Encoding

runPut (putByteString (encodeUtf8 txt))

this thus omits the length header.

Upvotes: 5

Related Questions