Bhupen
Bhupen

Reputation: 41

Regex expression using word boundary for matching alphanumeric and non alphanumeric characters in javascript

I am trying to highlight a set of keywords using JavaScript and regex, I facing one problem, my keyword may contain literal and special characters as in @text #number etc. I am using word boundary to match and replace the whole word and not a partial word (contained within another word).

var pattern = new regex('\b '( + keyword +')\b',gi);

Here this expression matches the whole keywords and highlights them, however in case if any keyword like "number:" do not get highlighted.

I am aware that \bword\b matches for a word boundary and special characters are non alphanumeric characters hence are not matched by the above expression. Can you let me know what regex expression I can use to accomplish the above.

==Update==

For the above I tried Tim Pietzcker's suggestion for the below regex,

expr: (?:^|\\b|\\s)(" + keyword + ")(?:$|\\b|\\s)

The above seems to be working for getting me a match for the whole word with alphanumeric and non alphanumeric characters, however whenever a keyword has consecutive html tag before or after the keyword without a space, it does not highlight that keyword (e.g. social security *number:< br >*) I tried the following regex, but it replaces the html tag preceding the keyword

expr: (?:^|\b|\s|<[^>]+>)number:(?:$|\b|\s|<[^>]+>) 

Here for the keyword number: which has < br > (space added intentionally for br tag to avoid browser interpreting the tag) coming next without space in between gets highlighted with the keyword.

Can you suggest an expression which would ignore the consecutive html tag for the whole word with both alphanumeric and non alphanumeric characters.

Upvotes: 4

Views: 3019

Answers (6)

Tim Pietzcker
Tim Pietzcker

Reputation: 336478

2021 update: JS now supports lookbehind so this answer is a little outdated.

OK, so you have two problems: JavaScript doesn't support lookbehind, and \b only finds boundaries between alphanumeric and non-alphanumeric characters.

The first question: What exactly does constitute a word boundary for your keywords? My guess is that it must be either a \b boundary or whitespace. If that is the case, you could search for

"(?:^|\\b|\\s)(" + keyword + ")(?:$|\\b|\\s)"

Of course the whitespace characters around keywords like @number# would also become part of the match, but perhaps highlighting those isn't such a problem. In other cases, i. e. if there is an actual word boundary that can match, the spaces won't be part of the match so it should work fine in the majority of cases.

The actual word you're interested in will be in backreference #1, so if you can highlight that separately, even better.

EDIT: If other characters than space may occur after/before a keyword, then I think the only thing you can do (if you're stuck with JavaScript) is:

  1. Check if your keyword starts with an alnum character.
  2. If so, prepend \b to your regex.
  3. Check if your keyword ends with an alnum character.
  4. If so, append \b to your regex.

So, for keyword, use \bkeyword\b; for number:, use \bnumber:; for @twitter, use @twitter\b.

Upvotes: 2

sumair
sumair

Reputation: 1

Try this it should work...

var pattern = new regex(@"\b"+Regex.escape(keyword)+@"\b",gi);

Upvotes: 0

tchrist
tchrist

Reputation: 80443

As Tim correctly points out, \b are tricky things that work differently than the way people often think they work. Read this answer for more details about this matter, and what you can do about it.

In brief, this is a boundary to the left:

(?(?=\w)(?<!\w)|(?<!\W))

and this is a boundary to the right:

(?(?<=\w)(?!\w)|(?!\W))

People always think there are spaces involved, but there aren’t. However, now that you know the real definitions, it’s easy to build that into them. One could swap out \w and \W in echange for \s and \Sin the two patterns above. Or one could add in whitespace awareness to the else blocks.

Upvotes: 0

PleaseStand
PleaseStand

Reputation: 32112

We need to look for a substring that has a whitespace character on both sides. If JavaScript supported lookbehind, this would look like:

var re = new RegExp('(?<!\\S)' + keyword + '(?!\\S)', 'gi');

That won't work though (but would in Perl and other scripting languages). Instead, we need to include the leading whitespace character (or beginning of string) as the beginning part of the match (and optionally capture what we are really looking for into $1):

var re = new RegExp('(?:^|\\s)(' + keyword + ')(?!\\S)', 'gi');

Just consider that the real place where any match starts will be one character after what is returned by the .index property returned by re.exec(string), and that if you are accessing the matched string, you either need to remove the first character with .slice(1) or simply access what is captured.

Upvotes: 1

Nathan
Nathan

Reputation: 11159

Lookahead and lookbehind are your answer: "(?=<[\s^])" + keyword + "(?=[\s$])". The bits in brackets aren't included in the match, so include whatever characters aren't permitted in the keywords in there.

Upvotes: 0

fcalderan
fcalderan

Reputation:

maybe what you're trying to do is

'\b\W*(' + keyword + ')\W*\b'

Upvotes: 0

Related Questions