how does notation
umber work in this example?

Question

I read from the textbook and learned " umber" refers to the nth group, but I still cannot understand it in the example.

import re
m = re.search(r'(\b\w+)\s+\1', 'Cherry tree blooming will begin in in later March, High Park Toronto')
print m.group(0)
print m.group(1)

I can sort of understand that it firstly returns "in in ", since "in in" matches what is in the parenthesis (\b\w+). But why it returns "in" for m.group(1)?

Then I modified the code a little bit to delete "\1":

import re
m = re.search(r'(\b\w+)\s+', 'Cherry tree blooming will begin in in later March, High Park Toronto')
print m.group(0)

It returns "Cherry". I am totally lost...

Could someone please explain these in details? Thanks.

user2705585 · Accepted Answer

In your regex (\b\w+)\s+\1 what you are saying is

Match a word(\w+) preceded by word boundary(\b) followed by one or many whitespaces(\s+) and followed by same word as captured by (\b\w+).

Since you used capturing group each such pattern is captured in first capturing group and checked for repeated pattern using \1. So first occurrence of such pattern is in in.

m.group(0) contains whole match. Whereas m.group(1) contains first captured group which is in.

When you remove \1 your regex becomes (\b\w+)\s+. Let's see what it's saying now.

Match a word(\w+) preceded by word boundary(\b) followed by one or many whitespaces(\s+).

So first occurrence of such pattern is Cherry.

m.group(0) now has the whole match Cherry.

how does notation \number work in this example?

Answers (2)

(\b\w+)\s+\1

(\b\w+)\s+

But why it returns "in" for `m.group(1)`?

Related Questions