JavaScript RegExp misbehaving

Question

My goal is to highlight feminine nouns (German) by wrapping them into a tag with a specific class="..." style.

As I'm dealing with a non-ASCII set I (unfortunately) cannot use the "word boundary" \b in JavaScript's RegEx so I'm forced to improvise by explicitly listing what I consider to be a word boundary.

My code (simplified and streamlined) looks like the following:

const wordBoundary = "(^|\s|$|/|\?|\.|\!|\ )";
"Liebe Grüße".replace(
    new RegExp(`${wordBoundary}(Liebe|Grüße)${wordBoundary}`, "g"),
    `$1$2$3`
);

However, this only highlights the first word, and not the second, producing

Liebe Grüße.

Debugging in console I (pretty much by accident) found out that if instead of RegExp object I use a regex initializer - everything works as expected, producing

Liebe Grüße:

"Liebe Grüße".replace(
    /(^|\s|$|/|\?|\.|\!|\ )(Liebe|Grüße)(^|\s|$|/|\?|\.|\!|\ )/g,
    `$1$2$3`
);

My question is two-fold:

Am I doing something wrong by creating a RegExp object and not using an in-place regex initializer? Because that looks like a bug to me, TBH
If I'm forced to use the regex initializer - how do I provide that custom wordBoundary for it?

skirtle · Accepted Answer

First let's consider your word boundary:

const wordBoundary = "(^|\s|$|/|\?|\.|\!|\ )";

Contrary to what has been asserted elsewhere this is correctly escaped. It isn't necessarily the best way to write it but it will work. The |\ ) for the space at the end isn't necessary as it's already covered by the \s> You also don't need to escape the !, but it won't hurt.

Let's consider a similar example that just uses ASCII:

const wordBoundary = "(^|\s|$|/|\?|\.|\!|\ )";

console.log(
    "cat dog".match(new RegExp(`${wordBoundary}(cat|dog)${wordBoundary}`, 'g'))
);

Notice that it only matches cat and not dog. Or to be more precise, it matches 'cat ', with a space at the end. This is the key. The space has already been matched so you can't match it again when attempting to match dog. Matches cannot overlap. To avoid this problem you'd use a positive lookahead to ensure the space isn't consumed:

const wordBoundary = "(^|\s|$|/|\?|\.|\!|\ )";

console.log(
    "cat dog".match(new RegExp(`${wordBoundary}(cat|dog)(?=${wordBoundary})`, 'g'))
);

Better, now it's matching both cat and dog. Notice how the space is now at the start of ' dog' because it is part of the second match and not part of the first.

To take things back to your original examples we could write it something like this:

const wordBoundary = '[\s/?.!]';

var re = new RegExp(`(^|${wordBoundary})(Liebe|Grüße|Ärztin)(?=${wordBoundary}|$)`, 'g');

console.log(re);

// Test cases
[
    'Liebe Grüße',
    'Liebe asGrüße Liebe Grüße Ärztin Grüße  bd',
    'Liebe GrüßeLiebe Grüße Ärztin Grüße  bd',
    'Liebe Grüßeas Liebe Grüße Ärztin Grüße  bd',
    'Liebe as Grüße Liebe Grüße Ärztin Grüße  bd',
    'Liebe Ärztin Grüße',
    'Liebe
Grüße',
    'Liebe	Grüße',
    'Liebe?Grüße',
    'Liebe.Grüße',
    'Liebe!Grüße',
    'Liebe/Grüße',
    'Liebe\Grüße'
].forEach(function(str) {
    console.log(str.replace(re, '$1$2'));
});

While I have changed the way the word boundary is written in that example it should be noted that writing it exactly the way it was written in the question would also have worked fine.

This leaves one open question: why did the extra escaping appear to work? Here's a simpler example to help demonstrate that:

// This is the same as:
// var re = new RegExp('(\\?)(Liebe|Grüße)(\\?)', 'g');

var re = /(\?)(Liebe|Grüße)(\?)/g;

console.log("Liebe Grüße".replace(re, `$1$2$3`));

console.log("LiebeXX Grüße".replace(re, `$1$2$3`));

console.log("Liebe\Grüße".replace(re, `$1$2$3`));

I've stripped out most of the word boundary and just left in the key part of the alternation, \?. The double slashes are an escape sequence for a single slash and the ? is being treated as the 'optional' modifier. So this matches an optional \. In other words, the word boundary will quite happily match an empty string. Effectively it just ignores the word boundary altogether, unless that boundary is a \ character.

When you're creating a RegExp using a string you need to escape the slashes an extra time (once for the string literal, once for the RegExp). However, you were already doing that in your original example. By escaping them another time (so that you have 4 slashes) you're just ending up with the 'match an optional slash' scenario.

JavaScript RegExp misbehaving

Answers (2)

Related Questions