Adam Millerchip
Adam Millerchip

Reputation: 23091

Enumerate a string in Elixir

In Elixir, if I have a string such as "José1 José2", how do I enumerate it? If I try to use Enum or for comprehensions, I get the following error:

** (Protocol.UndefinedError) protocol Enumerable not implemented for "José1 José2" of type BitString

Upvotes: 8

Views: 2769

Answers (1)

Adam Millerchip
Adam Millerchip

Reputation: 23091

Strings in Elixir are UTF-8 encoded binaries. If you want to enumerate a binary, which is just a collection of bytes, you need to specify how.

String.graphemes/1 - this will give you a list of strings, where each string contains an individual Unicode grapheme. This is probably closest to what you mean if you want each "character".

iex> String.graphemes("José1 José2")
["J", "o", "s", "é", "1", " ", "J", "o", "s", "é", "2"]

String.codepoints/1 - this will give you a list of strings broken down by Unicode codepoints. Note that a Unicode codepoint does not necessarily translate to a human-readable character.

iex> String.codepoints("José1 José2")
["J", "o", "s", "é", "1", " ", "J", "o", "s", "e", "́", "2"]

You can see that the first and the second é graphemes are represented differently in terms of unicode codepoints. The first one is LATIN SMALL LETTER E WITH ACUTE (U+00E9), whereas the second one is LATIN SMALL LETTER E (U+0065) followed by COMBINING ACUTE ACCENT (U+0301).

This is why you can't simply enumerate a string, because when dealing with Unicode, you have to specify whether you are interested in graphemes, or codepoints, or something else.

String.to_charlist/1 - gives you a list of the numerical Unicode codepoints of the string. This is can be used to interface with Erlang libraries that use this format.

iex> String.to_charlist("José1 José2")
[74, 111, 115, 233, 49, 32, 74, 111, 115, 101, 769, 50]

:binary.bin_to_list/1 - If you just want to enumerate the bytes.

iex> :binary.bin_to_list("José1 José2")
[74, 111, 115, 195, 169, 49, 32, 74, 111, 115, 101, 204, 129, 50]

Once you have a list, you can Enumerate it using comprehensions or any of the functions in the Enum module:

iex> for c <- String.graphemes("José1 José2"), into: "", do: c <> c
"JJoosséé11  JJoosséé22"

iex> "José1 José2" |> String.graphemes() |> Enum.join("|")
"J|o|s|é|1| |J|o|s|é|2"

It is also possible to use comprehensions with bitstring generators for enumerating the bytes and codepoints (but not the graphemes).

Equivalent to :binary.bin_to_list/1:

iex> for <<byte <- "José1 José2">>, do: byte
[74, 111, 115, 195, 169, 49, 32, 74, 111, 115, 101, 204, 129, 50]

Equivalent to String.to_charlist/1, by specifying the type of the binary is utf8:

iex> for <<cp::utf8 <- "José1 José2">>, do: cp
[74, 111, 115, 233, 49, 32, 74, 111, 115, 101, 769, 50]

Equivalent to String.codepoints/1, by specifying the type of the binary is utf8, and converting the resulting codepoints back to UTF-8 binaries:

iex> for <<cp::utf8 <- "José1 José2">>, do: <<cp::utf8>>
["J", "o", "s", "é", "1", " ", "J", "o", "s", "e", "́", "2"]

P.S. For further reading about character encodings, this blog post from 2003 is great: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Upvotes: 14

Related Questions