Reputation: 172

How do I use regex within a String.split without splitting on international characters?

I am working on an elixir problem where I have a phrase:

phrase = "duck duck goose more_ducks hyphenated-duck überduck"

I am attempting to split this string into a list of words where underscored words are separate words and hyphenated words are not. The following code works for me:

String.split(phrase, ~r{([^\w'-]+|_)})

with exception of the umlaut character, which it splits the word on. I would like for it not to split on international characters, but can't see to find a way that will work.

I have tried several permutations of ^p{Ll}$/u with the latest being:

String.split(~r{[^\w'-]+/^\p{L}/u|_})

I haven't been able to find out the purpose of the '$' before /u in my readings either, but it shows up in a lot of examples. I seem to get some sort or error no matter where I place it in the regex section.

Any insight or help would be very appreciated. I feel I am missing something basic.

Thank you in advance

UPDATE: One of the insights in the comments gave me a solution and explanation to my problem. The "u" is modifying the ~r{} sigil. When I put the "u" in the correct place, it worked fine:

String.split(~r{([^\w'-]+|_)}u)

Upvotes: 0

Answers (3)

Adam DePrince

Reputation: 1

Regular expressions can be somewhat hard to read. Much easier would be to use the xr wrapper for Python's regular expressions.

% pip install xr
% python
Python 3.8.5 (default, Jul 21 2020, 10:48:26)
...
>>> from xr import Text
>>> Text(' ').split('a b c')
['a', 'b', 'c']

xr also provides a bit of syntactic sugar for this use case:

WhiteSpace.split('a b c d')

Anyhow, you might be interested to know that I just added your duck duck goose example string to xd's unit tests.

>>> WhiteSpace.split("duck duck goose more_ducks hyphenated-duck überduck")
['duck', 'duck', 'goose', 'more_ducks', 'hyphenated-duck', 'überduck']

Upvotes: -2

Adam Millerchip

Reputation: 23111

You don't need to use Regex at all:

String.split(phrase, [" ", "_"])

Output:

["duck", "duck", "goose", "more", "ducks", "hyphenated-duck", "überduck"]

Upvotes: 1

Aleksei Matiushkin

Reputation: 121000

Use Regex.scan/3 which is more natural here. You need to explicitly set regex to unicode (u modifier to ~r// sigil) and match the sequence of subsequent letters and/or dashes.

Regex.scan ~r/[\p{L}'’-]+/u, phrase
#⇒ [
#    ["duck"],
#    ["duck"],
#    ["goose"],
#    ["more"],
#    ["ducks"],
#    ["hyphenated-duck"],
#    ["überduck"]
#  ]

Sidenote:

German character

Umlaut is by no mean “a German character,” it’s so-called combining diacritical mark named diaeresis that is used in many languages beyond German. See English word naïve, or French company Citroën for example.

Upvotes: 2

How do I use regex within a String.split without splitting on international characters?

Answers (3)

Related Questions