Reputation: 165

How to recognize german umlauts in word boundary?

My problem can be seen in this example: https://regex101.com/r/dToBvm/1/

I am trying to catch all words by using this regex \b([äöüÄÖÜß\w]+)\b

The funny thing is that "säs" will get matched, but not "äss", or "sää". If a word starts with an umlaut or ends with it then it won't match.

How do I solve this problem?

Upvotes: 2

Answers (3)

Stephan Hoyer

Reputation: 4928

More general solution based on https://stackoverflow.com/a/56945933/1029371

console.log('asdöö.ÄÄ-asdas'.split(/(?<!\p{Letter})(\p{Letter}+)(?!\p{Letter})/u))

Upvotes: 0

Michał Turczyn

Reputation: 37377

Because word boundary is matched between ä and s in äss and between s and ä in sää (that's how \b is defined).

You need to use negative lookarounds to achieve what you want:

(?<![äöüÄÖÜß\w])([äöüÄÖÜß\w]+)(?![äöüÄÖÜß\w])

Demo

Upvotes: 1

Emma

Reputation: 27723

I think your expression is good, maybe we would slightly modify that to:

(?<=^|\s)([\p{L}\p{N}]{3})(?=[\s.,]+|$)

The expression is explained on the top right panel of this demo, if you wish to explore further or modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.

Upvotes: 0

How to recognize german umlauts in word boundary?

Answers (3)

Related Questions