Reputation: 23091
In Elixir, if I have a string such as "José1 José2"
, how do I enumerate it? If I try to use Enum
or for
comprehensions, I get the following error:
** (Protocol.UndefinedError) protocol Enumerable not implemented for "José1 José2" of type BitString
Upvotes: 8
Views: 2769
Reputation: 23091
Strings in Elixir are UTF-8 encoded binaries. If you want to enumerate a binary, which is just a collection of bytes, you need to specify how.
String.graphemes/1
- this will give you a list of strings, where each string contains an individual Unicode grapheme. This is probably closest to what you mean if you want each "character".
iex> String.graphemes("José1 José2")
["J", "o", "s", "é", "1", " ", "J", "o", "s", "é", "2"]
String.codepoints/1
- this will give you a list of strings broken down by Unicode codepoints. Note that a Unicode codepoint does not necessarily translate to a human-readable character.
iex> String.codepoints("José1 José2")
["J", "o", "s", "é", "1", " ", "J", "o", "s", "e", "́", "2"]
You can see that the first and the second é
graphemes are represented differently in terms of unicode codepoints. The first one is LATIN SMALL LETTER E WITH ACUTE (U+00E9)
, whereas the second one is LATIN SMALL LETTER E (U+0065)
followed by COMBINING ACUTE ACCENT (U+0301)
.
This is why you can't simply enumerate a string, because when dealing with Unicode, you have to specify whether you are interested in graphemes, or codepoints, or something else.
String.to_charlist/1
- gives you a list of the numerical Unicode codepoints of the string. This is can be used to interface with Erlang libraries that use this format.
iex> String.to_charlist("José1 José2")
[74, 111, 115, 233, 49, 32, 74, 111, 115, 101, 769, 50]
:binary.bin_to_list/1
- If you just want to enumerate the bytes.
iex> :binary.bin_to_list("José1 José2")
[74, 111, 115, 195, 169, 49, 32, 74, 111, 115, 101, 204, 129, 50]
Once you have a list, you can Enumerate it using comprehensions or any of the functions in the Enum
module:
iex> for c <- String.graphemes("José1 José2"), into: "", do: c <> c
"JJoosséé11 JJoosséé22"
iex> "José1 José2" |> String.graphemes() |> Enum.join("|")
"J|o|s|é|1| |J|o|s|é|2"
It is also possible to use comprehensions with bitstring generators for enumerating the bytes and codepoints (but not the graphemes).
Equivalent to :binary.bin_to_list/1
:
iex> for <<byte <- "José1 José2">>, do: byte
[74, 111, 115, 195, 169, 49, 32, 74, 111, 115, 101, 204, 129, 50]
Equivalent to String.to_charlist/1
, by specifying the type of the binary is utf8
:
iex> for <<cp::utf8 <- "José1 José2">>, do: cp
[74, 111, 115, 233, 49, 32, 74, 111, 115, 101, 769, 50]
Equivalent to String.codepoints/1
, by specifying the type of the binary is utf8
, and converting the resulting codepoints back to UTF-8 binaries:
iex> for <<cp::utf8 <- "José1 José2">>, do: <<cp::utf8>>
["J", "o", "s", "é", "1", " ", "J", "o", "s", "e", "́", "2"]
P.S. For further reading about character encodings, this blog post from 2003 is great: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Upvotes: 14