Reputation: 1368
I've been trying come up with a regex that will replace a word that may or may not contain accent characters. I've been researching this for the past couple days, but cannot find the information I need to solve my problem.
I had come up with a simple regex that handles words without accent characters great:
var re = new RegExp('(?:\\b)hello(?:\\b)', 'gm');
var string = 'hello hello hello world hellos hello';
string.replace(re, "FOO");
Result: FOO FOO FOO world hellos FOO
The above works as I want. The problem with the above code, is when the word contains an accent character as the first, or last character in the string. Example:
var re = new RegExp('(?:\\b)helló(?:\\b)', 'gm');
var string = 'helló helló helló world hellós helló';
string.replace(re, "FOO");
Result: helló helló helló world FOOs helló
Desired result: FOO FOO FOO world hellós FOO
From my understanding, the above is occurring because an accented character is interpreted as a boundary. My attempt at solving the problem (note: the range [A-zÀ-ÿ]
is what I consider the valid alphabet to construct a word):
var re = new RegExp('([^A-zÀ-ÿ]|^)helló([^A-zÀ-ÿ]|$)', 'gm');
var string = 'helló helló helló world hellós helló';
string.replace(re, "$1FOO$2");
Result: FOO helló FOO world hellós FOO
As you can see, I'm much closer to the desired result. However, the problem occurs when the word in question appears three or more times in a row. Please note the second occurrence of helló
was ignored. I believe that's because the whitespace preceding it was already matched by the first occurence of helló
.
Does anybody have any suggestions on how to achieve FOO FOO FOO world hellós FOO
?
Upvotes: 0
Views: 1797
Reputation: 51
The answer is a little complex, but has been answered in the following as to why you are struggling on this issue: Why can't I use accented characters next to a word boundary?
However, given the lack of good unicode support in Javascript, especially before ECMAScript 6 (I've had this issue myself in the past). I have found that it is often better to use a third party library with better unicode support such as: http://xregexp.com/
This also eliminates some of the variances in support from older browsers.
Upvotes: 2