Harish Shetty
Harish Shetty

Reputation: 64363

Handling word boudaries in regular expression

In the string below, I am trying to match the stand alone Inc.s.

Inc. aa Inc. bbbInc. Inc.

The following regular expressions didn't work:

/\bInc\.\b/       # got zero matches
/\bInc\.(\b|$)/   # matched the last Inc.

I think it is because \b matches boundaries between word and non word characters, where I have a \b after the \., which is a non word character. I tweaked it to make it work.

/\bInc\.($|\W)/
/\bInc\.\B/

Upvotes: 1

Views: 75

Answers (2)

sawa
sawa

Reputation: 168101

You wanted to match "Inc." followed by a non-word character. Since "." is a non-word character, What you expect at the ending boundary is a \W\W sequence (or the end of the string). \b matches the boundary of either a \w\W or \W\w sequence, so its match will not intersect with the expected match.

The fourth expression works because \B matches the boundary of either a \w\w sequence or a \W\W sequence (or the beginning or the end of a string), and since "." matches \W, the \.\B match is narrowed down to \W\W (or the end of a string), which you wanted.

Comparing the third and the fourth expressions, the third one has two problems. (1) Notice that \W matches a string. So /\bInc\.($|\W)/ will include within the match the character that follows the part you want. In order to avoid this, you can use a lookahead: /\bInc\.(?=$|\W)/, but compared to that, the fourth one is much better. (2) Although it is not a problem with your particular example, when the string goes beyond a single line, $ will not correctly match the end of the string. Using \z is better.

I cannot think of a one better than your fourth one.

Upvotes: 2

Ben
Ben

Reputation: 13635

From the Perl regex documentation

A word boundary (\b ) is a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W .

since \w represents [a-zA-Z0-9_] \b wont match the . as you correctly assume.

\bInc\.\B

Will match Inc.., or any non \w character after Inc. same goes for

\bInc\.($|\W)

If you want to match Inc. followed by a whitespace or a newline I'd use

\bInc\.(\s|$)

Upvotes: 0

Related Questions