Reputation: 5856
My goal is to highlight feminine nouns (German) by wrapping them into a <span>
tag with a specific class="..."
style.
As I'm dealing with a non-ASCII set I (unfortunately) cannot use the "word boundary" \b
in JavaScript's RegEx so I'm forced to improvise by explicitly listing what I consider to be a word boundary.
My code (simplified and streamlined) looks like the following:
const wordBoundary = "(^|\\s|$|/|\\?|\\.|\\!|\\ )";
"Liebe Grüße".replace(
new RegExp(`${wordBoundary}(Liebe|Grüße)${wordBoundary}`, "g"),
`<span class="nounF">$1$2$3</span>`
);
However, this only highlights the first word, and not the second, producing
<span class="nounF">Liebe </span>Grüße
.
Debugging in console I (pretty much by accident) found out that if instead of RegExp
object I use a regex initializer - everything works as expected, producing
<span class="nounF">Liebe</span> <span class="nounF">Grüße</span>
:
"Liebe Grüße".replace(
/(^|\\s|$|\/|\\?|\\.|\\!|\\ )(Liebe|Grüße)(^|\\s|$|\/|\\?|\\.|\\!|\\ )/g,
`<span class="nounF">$1$2$3</span>`
);
My question is two-fold:
RegExp
object and not using an in-place regex initializer? Because that looks like a bug to me, TBHwordBoundary
for it?Upvotes: 2
Views: 61
Reputation: 170
You have to double the backslashes:
const wordBoundary = "(^|\\\\s|$|/|\\\\?|\\\\.|\\\\!|\\\\ )";
This is because (in your scenario) variable wordBoundary
contains correctly escaped backslashes (\\
), but when you reuse that variable again in ${...}
you lose the escaping (all the \\
have become \
and now you escape other characters). RegExp literal completely avoids this problem.
EDIT: this is completely wrong, but if you are reading this and still don't know the correct answer, take a minute and think about why is it wrong.
Upvotes: 0
Reputation: 29092
First let's consider your word boundary:
const wordBoundary = "(^|\\s|$|/|\\?|\\.|\\!|\\ )";
Contrary to what has been asserted elsewhere this is correctly escaped. It isn't necessarily the best way to write it but it will work. The |\\ )
for the space at the end isn't necessary as it's already covered by the \\s
> You also don't need to escape the !
, but it won't hurt.
Let's consider a similar example that just uses ASCII:
const wordBoundary = "(^|\\s|$|/|\\?|\\.|\\!|\\ )";
console.log(
"cat dog".match(new RegExp(`${wordBoundary}(cat|dog)${wordBoundary}`, 'g'))
);
Notice that it only matches cat
and not dog
. Or to be more precise, it matches 'cat '
, with a space at the end. This is the key. The space has already been matched so you can't match it again when attempting to match dog
. Matches cannot overlap. To avoid this problem you'd use a positive lookahead to ensure the space isn't consumed:
const wordBoundary = "(^|\\s|$|/|\\?|\\.|\\!|\\ )";
console.log(
"cat dog".match(new RegExp(`${wordBoundary}(cat|dog)(?=${wordBoundary})`, 'g'))
);
Better, now it's matching both cat
and dog
. Notice how the space is now at the start of ' dog'
because it is part of the second match and not part of the first.
To take things back to your original examples we could write it something like this:
const wordBoundary = '[\\s/?.!]';
var re = new RegExp(`(^|${wordBoundary})(Liebe|Grüße|Ärztin)(?=${wordBoundary}|$)`, 'g');
console.log(re);
// Test cases
[
'Liebe Grüße',
'Liebe asGrüße Liebe Grüße Ärztin Grüße bd',
'Liebe GrüßeLiebe Grüße Ärztin Grüße bd',
'Liebe Grüßeas Liebe Grüße Ärztin Grüße bd',
'Liebe as Grüße Liebe Grüße Ärztin Grüße bd',
'Liebe Ärztin Grüße',
'Liebe\nGrüße',
'Liebe\tGrüße',
'Liebe?Grüße',
'Liebe.Grüße',
'Liebe!Grüße',
'Liebe/Grüße',
'Liebe\\Grüße'
].forEach(function(str) {
console.log(str.replace(re, '$1<b>$2</b>'));
});
While I have changed the way the word boundary is written in that example it should be noted that writing it exactly the way it was written in the question would also have worked fine.
This leaves one open question: why did the extra escaping appear to work? Here's a simpler example to help demonstrate that:
// This is the same as:
// var re = new RegExp('(\\\\?)(Liebe|Grüße)(\\\\?)', 'g');
var re = /(\\?)(Liebe|Grüße)(\\?)/g;
console.log("Liebe Grüße".replace(re, `<b>$1$2$3</b>`));
console.log("LiebeXX Grüße".replace(re, `<b>$1$2$3</b>`));
console.log("Liebe\\Grüße".replace(re, `<b>$1$2$3</b>`));
I've stripped out most of the word boundary and just left in the key part of the alternation, \\?
. The double slashes are an escape sequence for a single slash and the ?
is being treated as the 'optional' modifier. So this matches an optional \
. In other words, the word boundary will quite happily match an empty string. Effectively it just ignores the word boundary altogether, unless that boundary is a \
character.
When you're creating a RegExp using a string you need to escape the slashes an extra time (once for the string literal, once for the RegExp). However, you were already doing that in your original example. By escaping them another time (so that you have 4 slashes) you're just ending up with the 'match an optional slash' scenario.
Upvotes: 2