user1151923
user1151923

Reputation: 1872

.*? does not match character before word boundary

I have a hard time understanding why ((?i)\bb.*?\b) returns b and not b- for the string a b- c. I also tried ((?i)\bb\w*\b), but that does not work any better.

Some more info:

I need to match words in a text. I need to retrieve all words that start with the letter b. And 'words' means pretty much any character string that starts with a b, such as b, b-, b', b" etc. The 'words' I need to match are not of course limited with a space such as in the example.

Upvotes: 1

Views: 108

Answers (3)

JDB
JDB

Reputation: 25810

* is called a "greedy" quantifier. It'll match as many iterations of the preceding pattern as possible. Most of the time, this is exactly what you want, but sometimes you want to use a "lazy" quantifier, meaning it'll match as few as possible, including 0.

To make a quantifier "lazy", you add a question mark: *?, +?, ??, etc.

Now, the next part of the answer is how word boundaries work. Word boundaries will match a position where there's a "break" between "word characters" (0-9, a-z and _) and "non-word characters". - is a non-word character, so the positions between b-, -c and c would all work.

Because you've got a lazy quantifier and there's a word boundary immediately after the b, that's all that your regex will match.

Rather than trying to use a word boundary to find the end of your word, just match word characters and dashes, like so, which will naturally match everything to the "end" of the word:

\bb[-\w]*

See a working example

Upvotes: 1

bokibeg
bokibeg

Reputation: 2142

This should give you the desired result:

(b.*?)(?:\s|$)

I've tested it on a b- c bfdf b32=" dfa b. b---s asd b.

It seems like you're not looking for words but any string starting with a letter "b" delimited by a space (or other?) character(s). Your original pattern can't work because "-" doesn't qualify as part of a word. Good luck.

Note: Above pattern is very simple, the last part with $ is there so that the last "b" is captured which is on the end of the line.

Upvotes: 1

The Sidhekin
The Sidhekin

Reputation: 283

.*? is minimal, so b.*?\b finds the first word boundary after the b. Since b is a word character, and - is not, that first word boundary is between those characters.

ETA: Thing is, regexen don't consider your 'words' to be words, so \b won't work for them. You say your 'words' don't always end with a space. And, obviously, they won't end with a hyphen. How, more precisely do they end?

Upvotes: 0

Related Questions