Reputation: 51
I read from the textbook and learned "\number" refers to the nth group, but I still cannot understand it in the example.
import re
m = re.search(r'(\b\w+)\s+\1', 'Cherry tree blooming will begin in in later March, High Park Toronto')
print m.group(0)
print m.group(1)
I can sort of understand that it firstly returns "in in ", since "in in" matches what is in the parenthesis (\b\w+). But why it returns "in" for m.group(1)?
Then I modified the code a little bit to delete "\1":
import re
m = re.search(r'(\b\w+)\s+', 'Cherry tree blooming will begin in in later March, High Park Toronto')
print m.group(0)
It returns "Cherry". I am totally lost...
Could someone please explain these in details? Thanks.
Upvotes: 0
Views: 53
Reputation:
In your regex (\b\w+)\s+\1
what you are saying is
Match a word(\w+) preceded by word boundary(\b) followed by one or many whitespaces(\s+) and followed by same word as captured by (\b\w+).
Since you used capturing group each such pattern is captured in first capturing group and checked for repeated pattern using \1
. So first occurrence of such pattern is in in
.
m.group(0)
contains whole match. Whereas m.group(1)
contains first captured group which is in
.
When you remove \1
your regex becomes (\b\w+)\s+
. Let's see what it's saying now.
Match a word(\w+) preceded by word boundary(\b) followed by one or many whitespaces(\s+).
So first occurrence of such pattern is Cherry
.
m.group(0)
now has the whole match Cherry
.
Upvotes: 1
Reputation: 11375
This is because you're matching a word (follow by a space), followed by the same match in group 1. Since in in
is the first (and only sequence) of the same word in sequence, it matches.
For example, if you have Cherry tree tree [...]
, your match would be tree tree
.
This is simply finding a word followed by a space. Since Cherry
is the first word, it matches.
m.group(1)
?With re.search
, m.group(0)
holds the whole match, and m.group(1)
holds the first capture group - which is in
.
Upvotes: 1