Reputation: 172
I am working on an elixir problem where I have a phrase:
phrase = "duck duck goose more_ducks hyphenated-duck überduck"
I am attempting to split this string into a list of words where underscored words are separate words and hyphenated words are not. The following code works for me:
String.split(phrase, ~r{([^\w'-]+|_)})
with exception of the umlaut character, which it splits the word on. I would like for it not to split on international characters, but can't see to find a way that will work.
I have tried several permutations of ^p{Ll}$/u
with the latest being:
String.split(~r{[^\w'-]+/^\p{L}/u|_})
I haven't been able to find out the purpose of the '$' before /u in my readings either, but it shows up in a lot of examples. I seem to get some sort or error no matter where I place it in the regex section.
Any insight or help would be very appreciated. I feel I am missing something basic.
Thank you in advance
UPDATE: One of the insights in the comments gave me a solution and explanation to my problem. The "u" is modifying the ~r{} sigil. When I put the "u" in the correct place, it worked fine:
String.split(~r{([^\w'-]+|_)}u)
Upvotes: 0
Views: 359
Reputation: 1
Regular expressions can be somewhat hard to read. Much easier would be to use the xr
wrapper for Python's regular expressions.
% pip install xr
% python
Python 3.8.5 (default, Jul 21 2020, 10:48:26)
...
>>> from xr import Text
>>> Text(' ').split('a b c')
['a', 'b', 'c']
xr
also provides a bit of syntactic sugar for this use case:
WhiteSpace.split('a b c d')
Anyhow, you might be interested to know that I just added your duck duck goose example string to xd
's unit tests.
>>> WhiteSpace.split("duck duck goose more_ducks hyphenated-duck überduck")
['duck', 'duck', 'goose', 'more_ducks', 'hyphenated-duck', 'überduck']
Upvotes: -2
Reputation: 23111
You don't need to use Regex at all:
String.split(phrase, [" ", "_"])
Output:
["duck", "duck", "goose", "more", "ducks", "hyphenated-duck", "überduck"]
Upvotes: 1
Reputation: 121000
Use Regex.scan/3
which is more natural here. You need to explicitly set regex to unicode (u
modifier to ~r//
sigil) and match the sequence of subsequent letters and/or dashes.
Regex.scan ~r/[\p{L}'’-]+/u, phrase
#⇒ [
# ["duck"],
# ["duck"],
# ["goose"],
# ["more"],
# ["ducks"],
# ["hyphenated-duck"],
# ["überduck"]
# ]
Sidenote:
German character
Umlaut is by no mean “a German character,” it’s so-called combining diacritical mark named diaeresis that is used in many languages beyond German. See English word naïve, or French company Citroën for example.
Upvotes: 2