Reputation: 4175
Good day.
The one thing I now hate about Haskell is quantity of packages for working with string.
First I used native Haskell [Char]
strings, but when I tried to start using hackage libraries then completely lost in endless conversions. Every package seem to use different strings implementation, some adopts their own handmade thing.
Next I rewrote my code with Data.Text
strings and OverloadedStrings
extension, I chose Text
because it has a wider set of functions, but it seems many projects prefer ByteString
.
Someone could give short reasoning why to use one or other?
PS: btw how to convert from Text
to ByteString
?
Couldn't match expected type Data.ByteString.Lazy.Internal.ByteString against inferred type Text Expected type: IO Data.ByteString.Lazy.Internal.ByteString Inferred type: IO Text
I tried encodeUtf8
from Data.Text.Encoding
, but no luck:
Couldn't match expected type Data.ByteString.Lazy.Internal.ByteString against inferred type Data.ByteString.Internal.ByteString
UPD:
Thanks for responses, that *Chunks goodness looks like way to go, but I somewhat shocked with result, my original function looked like this:
htmlToItems :: Text -> [Item]
htmlToItems =
getItems . parseTags . convertFuzzy Discard "CP1251" "UTF8"
And now became:
htmlToItems :: Text -> [Item]
htmlToItems =
getItems . parseTags . fromLazyBS . convertFuzzy Discard "CP1251" "UTF8" . toLazyBS
where
toLazyBS t = fromChunks [encodeUtf8 t]
fromLazyBS t = decodeUtf8 $ intercalate "" $ toChunks t
And yes, this function is not working because its wrong, if we supply Text
to it, then we're confident this text is properly encoded and ready to use and converting it is stupid thing to do, but such a verbose conversion still has to take place somewhere outside htmltoItems
.
Upvotes: 80
Views: 15861
Reputation: 23038
For what it's worth, I found these two helper functions to be quite useful:
import qualified Data.ByteString.Char8 as BS
import qualified Data.Text as T
-- | Text to ByteString
tbs :: T.Text -> BS.ByteString
tbs = BS.pack . T.unpack
-- | ByteString to Text
bst :: BS.ByteString -> T.Text
bst = T.pack . BS.unpack
Example:
foo :: [BS.ByteString]
foo = ["hello", "world"]
bar :: [T.Text]
bar = bst <$> foo
baz :: [BS.ByteString]
baz = tbs <$> bar
Upvotes: 1
Reputation: 24802
ByteStrings
are mainly useful for binary data, but they are also an efficient way to process text if all you need is the ASCII character set. If you need to handle unicode strings, you need to use Text
. However, I must emphasize that neither is a replacement for the other, and they are generally used for different things: while Text
represents pure unicode, you still need to encode to and from a binary ByteString
representation whenever you e.g. transport text via a socket or a file.
Here is a good article about the basics of unicode, which does a decent job of explaining the relation of unicode code-points (Text
) and the encoded binary bytes (ByteString
): The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
You can use the Data.Text.Encoding module to convert between the two datatypes, or Data.Text.Lazy.Encoding if you are using the lazy variants (as you seem to be doing based on your error messages).
Upvotes: 69
Reputation: 1008
Use a single function cs
from the Data.String.Conversions
.
It will allow you to convert between String
, ByteString
and Text
(as well as ByteString.Lazy
and Text.Lazy
), depending on the input and the expected types.
You still have to call it, but no longer to worry about the respective types.
See this answer for usage example.
Upvotes: 7
Reputation: 28097
You definitely want to be using Data.Text for textual data.
encodeUtf8
is the way to go. This error:
Couldn't match expected type Data.ByteString.Lazy.Internal.ByteString against inferred type Data.ByteString.Internal.ByteString
means that you're supplying a strict bytestring to code which expects a lazy bytestring. Conversion is easy with the fromChunks
function:
Data.ByteString.Lazy.fromChunks :: [Data.ByteString.Internal.ByteString] -> ByteString
so all you need to do is add the function fromChunks [myStrictByteString]
wherever the lazy bytestring is expected.
Conversion the other way can be accomplished with the dual function toChunks
, which takes a lazy bytestring and gives a list of strict chunks.
You may want to ask the maintainers of some packages if they'd be able to provide a text interface instead of, or in addition to, a bytestring interface.
Upvotes: 26