Nona
Nona

Reputation: 5462

Why do you need these characters in the regex in this Elixir regex match?

I came across a method such as:

  @spec split_words(String.t) :: [String.t]
  defp split_words(text) do
    Regex.scan ~r/(*UTF)[\p{L}0-9-]+/i, text
  end

It's really to be able to pass the following test:

  test "German" do
    expected = %{"götterfunken" => 1, "schöner" => 1, "freude" => 1}
    assert Words.count("Freude schöner Götterfunken") == expected
  end

What is (*UTF) - is that Elixir specific or a regex concept? I'm guessing it's to "cast" the string to UTF encoding. And what about \p{L} - is this an "expander" of some kind to let you know to use an alphabet that includes the umlaut character?

I saw it in this repository: https://github.com/alxndr/exercism/blob/master/elixir/word-count/word_count.exs#L25

Upvotes: 1

Views: 619

Answers (1)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89574

No, (*UTF) tells the PCRE regex engine (probably the one used in Elixir) to read the target string as an UTF-8 encoded string (otherwise the string is read one byte at once). But it doesn't cast the target string.

\p{L} is a unicode character class that contains all letters (in all alphabets, with or without accents).

more infos here: http://pcre.org/original/pcre.txt

Upvotes: 4

Related Questions