Reputation: 5462
I came across a method such as:
@spec split_words(String.t) :: [String.t]
defp split_words(text) do
Regex.scan ~r/(*UTF)[\p{L}0-9-]+/i, text
end
It's really to be able to pass the following test:
test "German" do
expected = %{"götterfunken" => 1, "schöner" => 1, "freude" => 1}
assert Words.count("Freude schöner Götterfunken") == expected
end
What is (*UTF)
- is that Elixir specific or a regex concept? I'm guessing it's to "cast" the string to UTF encoding. And what about \p{L}
- is this an "expander" of some kind to let you know to use an alphabet that includes the umlaut character?
I saw it in this repository: https://github.com/alxndr/exercism/blob/master/elixir/word-count/word_count.exs#L25
Upvotes: 1
Views: 619
Reputation: 89574
No, (*UTF)
tells the PCRE regex engine (probably the one used in Elixir) to read the target string as an UTF-8 encoded string (otherwise the string is read one byte at once). But it doesn't cast the target string.
\p{L}
is a unicode character class that contains all letters (in all alphabets, with or without accents).
more infos here: http://pcre.org/original/pcre.txt
Upvotes: 4