Reputation: 581
I"m currently studying regular expression groups. I'm having trouble fully understanding the first example presented in the book under groups. The book gives the following example:
/(\S+) (\S*) ?\b(\S+)/
I understand that this will match at most three words (consisting of any character except a white space), where the second word and space is optional.
What I have trouble understanding is the function of the boundary condition to start the match of the last group at the beginning of the third word.
When there are three words It makes no difference whether it is included or not.
When there are only two words there is a difference between group #2 and group #3
So, my question is as follows
When there are two words, why is the presence of \b
causing group#2 to be an empty string as expected, but when not present causes group #2 to contain the second word minus the last letter and group #3 to contain the last letter of the second word?
Upvotes: 4
Views: 95
Reputation: 28953
The difference comes from the second group (\S*)
- it will capture any amount of non-whitespace characters. So, when you have two words but three groups where the last one is (\S+)
- match at least one non-whitespace character, the regex engine will try to satisfy both group 2 and 3.
Remember that it's matching a pattern and you've not told it not to match like that. Hence it does the minimum work necessary - the second group's \S*
will initially match everything grabbing brown
- the next part of the pattern is an optional space, which passes, then it gets to the final group \S+
and since it has a mandatory character, the second match will release matches one by one until group 3 is satisfied.
You can see this here - I've defined the third group to have at least two mandatory characters, hence it only gets two:
let [ , group1, group2, group3] = "the brown".match(/(\S+) (\S*) ?(\S{2,})/);
console.log("group 1:", group1)
console.log("group 2:", group2)
console.log("group 3:", group3)
When you instead add the word boundary \b
to the pattern, you cannot have group 2 have any characters and satisfy the later condition - when a regex consumes a character the rest of the pattern will only continue from that character onward, hence you cannot have, for example group 2 match b
and then have a word boundary followed by rown
. The only way that (\S+) (\S*) ?\b(\S+)
can be satisfied is the following:
the
brown
Upvotes: 1
Reputation: 370639
When there are two words, why is the presence of \b causing group#2 to be an empty string as expected
Look at the first and third groups - being (\S+)
, they must contain characters. When there are two words, those two words must go into the first and third group - the second group, since it's repeated with *
, will not consume any characters, and will be the empty string.
but when not present causes group #2 to contain the second word minus the last letter and group #3 to contain the last letter of the second word?
When the pattern is
(\S+) (\S*) ?(\S+)
once the engine has matched the first word, the engine will start trying to match the second word. So if the input is foo bar
, we can consider how the pattern (\S*) ?(\S+)
works on bar
.
The engine first tries to consume all remaining characters in the string with the \S*
. This fails, because the last group is required to contain at least one character, so the engine backs up a step, and has the \S*
group match all but the last character. This results in a successful match, because the position before the last character does match \s?(\S+)
.
You can see this process visually here:
https://regex101.com/r/RAkEOt/1/debugger
In the first pattern, the word boundary before the last group ensures that the second group does not match any characters, in case there are only two words in the string - rather than backtracking to just before the last character, it must back up all the way until a word boundary is found:
The original pattern may be slightly flawed - \b
matches a word boundary, but not every non-space character is a word character - it (probably undesirably) matches foo it's
where the it'
goes into the second group, and the s
goes into the third group.
Upvotes: 2