Reputation: 165
My problem can be seen in this example: https://regex101.com/r/dToBvm/1/
I am trying to catch all words by using this regex \b([äöüÄÖÜß\w]+)\b
The funny thing is that "säs" will get matched, but not "äss", or "sää". If a word starts with an umlaut or ends with it then it won't match.
How do I solve this problem?
Upvotes: 2
Views: 927
Reputation: 4928
More general solution based on https://stackoverflow.com/a/56945933/1029371
console.log('asdöö.ÄÄ-asdas'.split(/(?<!\p{Letter})(\p{Letter}+)(?!\p{Letter})/u))
Upvotes: 0
Reputation: 37377
Because word boundary is matched between ä
and s
in äss
and between s
and ä
in sää
(that's how \b
is defined).
You need to use negative lookarounds to achieve what you want:
(?<![äöüÄÖÜß\w])([äöüÄÖÜß\w]+)(?![äöüÄÖÜß\w])
Upvotes: 1
Reputation: 27723
I think your expression is good, maybe we would slightly modify that to:
(?<=^|\s)([\p{L}\p{N}]{3})(?=[\s.,]+|$)
The expression is explained on the top right panel of this demo, if you wish to explore further or modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.
Upvotes: 0