CroweMan
CroweMan

Reputation: 343

Regex word boundary expressions

Say for example I have the following string "one two(three) (three) four five" and I want to replace "(three)" with "(four)" but not within words. How would I do it?

Basically I want to do a regex replace and end up with the following string:

"one two(three) (four) four five"

I have tried the following regex but it doesn't work:

@"\b\(three\)\b"

Basically I am writing some search and replace code and am giving the user the usual options to match case, match whole word etc. In this instance the user has chosen to match whole words but I don't know what the text being searched for will be.

Upvotes: 34

Views: 42839

Answers (6)

sarh
sarh

Reputation: 6647

Just a stupid thing why word boundary didn't work in my case - my regex expression was defined like "\bFOO\b", not like @"\bFOO\b" or "\\bFOO\\b".

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627468

See what a word boundary matches:

A word boundary can occur in one of three positions:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in the string, where one is a word character and the other is not a word character.

So, your \b\(three\)\b regex DOES work, but NOT the way you expected. It does not match (three) in In (three) years, In(three) years and In (three)years, but it matches in In(three)years because there are word boundaries between n and ( and between ) and y.

What you can do in these situations is use dynamic adaptive word boundaries that are constructs that ensure whole word matching where they are expected only (see my "Dynamic adaptive word boundaries" YT video for better visual understanding of these constructs).

In C#, it can be written as

@"(?!\B\w)\(three\)(?<!\w\B)"

In short:

  • (?!\B\w) - only require a word boundary on the left if the char that follows the word boundary is a word char
  • \(three\)
  • (?<!\w\B) - only require a word boundary on the right if the char that precedes the word boundary is a word char.

In case your search phrases can contain whitespaces and you need to match the longer alternatives first you can build the pattern dynamically from a list like

var phrases = new List<string> { @"(one)", @".two.", "[three]" };
phrases = phrases.OrderByDescending(x => x.Length).ToList();
var pattern = $@"(?!\B\w)(?:{string.Join("|", phrases.Select(z => Regex.Escape(z)))})(?<!\w\B)";

with the resulting pattern like (?!\B\w)(?:\[three]|\(one\)|\.two\.)(?<!\w\B) that matches what you'd expect, see the C# demo and the regex demo.

Upvotes: 2

Dominique Terrs
Dominique Terrs

Reputation: 629

Here a simple code you may be interested in:

    string pattern = @"\b" + find + @"\b";
    Regex.Replace(stringToSearch, pattern, replace, RegexOptions.IgnoreCase);

Source code: snip2code - C#: Replace an exact word in a sentence

Upvotes: 9

jongala
jongala

Reputation: 527

I recently came across a similar issue in javascript trying to match terms with a leading '$' character only as separate words, e.g. if $hot = 'FUZZ', then:

"some $hot $hotel bird$hot pellets" ---> "some FUZZ $hotel bird$hot pellets"

The regex /\b\$hot\b/g (my first guess) did not work for the same reason the parens did not match in the original question — as non word characters, there is no word/non-word boundary preceding them with whitespace or a string start.

However the regex /\B\$hot\b/g does match, which shows that the positions not marked in @timwi's excellent example match the \B term. This was not intuitive to me because ") (" is not made of regex word characters. But I guess since \B is an inversion of the \b class, it doesn't have to be word characters, it just has to be not- not- word characters :)

Upvotes: 0

Timwi
Timwi

Reputation: 66604

Your problem stems from a misunderstanding of what \b actually means. Admittedly, it is not obvious.

The reason \b\(three\)\b doesn’t match the threes in your input string is the following:

  • \b means: the boundary between a word character and a non-word character.
  • Letters (e.g. a-z) are considered word characters.
  • Punctuation marks such as ( are considered non-word characters.

Here is your input string again, stretched out a bit, and I’ve marked the places where \b matches:

 o n e   t w o ( t h r e e )   ( t h r e e )   f o u r   f i v e
↑     ↑ ↑     ↑ ↑         ↑     ↑         ↑   ↑       ↑ ↑       ↑

As you can see here, there is a \b between “two” and “(three)”, but not before the second “(three)”.

The moral of the story? “Whole-word search” doesn’t really make much sense if what you’re searching for is not just a word (a string of letters). Since you have punctuation characters (parentheses) in your search string, it is not as such a “word”. If you searched for a word consisting only of word characters, then \b would do what you expect.

You can, of course, use a different Regex to match the string only if it surrounded by spaces or occurs at the beginning or end of the string:

(^|\s)\(three\)(\s|$)

However, the problem with this is, of course, that if you search for “three” (without the parentheses), it won’t find the one in “(three)” because it doesn’t have spaces around it, even though it is actually a whole word.

I think most text editors (including Visual Studio) will use \b only if your search string actually starts and/or ends with a word character:

var pattern = Regex.Escape(searchString);
if (Regex.IsMatch(searchString, @"^\w"))
    pattern = @"\b" + pattern;
if (Regex.IsMatch(searchString, @"\w$"))
    pattern = pattern + @"\b";

That way they will find “(three)” even if you select “whole words only”.

Upvotes: 68

AllenG
AllenG

Reputation: 8190

As Gopi said, but (theoretically) catching only (three) not two(three):

string input = "one two(three) (three) four five";

string output = input.Replace(" (three) ", " (four) ");

When I test that, I get: "one two(three) (four) four five" Just remember that white-space is a string character, too, so it can also be replaced. If I did this:

//use same input
string output = input.Replace(" ", ";");

I'd get one;two(three);(three);four;five"

Upvotes: -1

Related Questions