Amelia
Amelia

Reputation: 51

how does notation \number work in this example?

I read from the textbook and learned "\number" refers to the nth group, but I still cannot understand it in the example.

import re
m = re.search(r'(\b\w+)\s+\1', 'Cherry tree blooming will begin in in later March, High Park Toronto')
print m.group(0)
print m.group(1)

I can sort of understand that it firstly returns "in in ", since "in in" matches what is in the parenthesis (\b\w+). But why it returns "in" for m.group(1)?

Then I modified the code a little bit to delete "\1":

import re
m = re.search(r'(\b\w+)\s+', 'Cherry tree blooming will begin in in later March, High Park Toronto')
print m.group(0)

It returns "Cherry". I am totally lost...

Could someone please explain these in details? Thanks.

Upvotes: 0

Views: 53

Answers (2)

user2705585
user2705585

Reputation:

In your regex (\b\w+)\s+\1 what you are saying is

Match a word(\w+) preceded by word boundary(\b) followed by one or many whitespaces(\s+) and followed by same word as captured by (\b\w+).

Since you used capturing group each such pattern is captured in first capturing group and checked for repeated pattern using \1. So first occurrence of such pattern is in in.

m.group(0) contains whole match. Whereas m.group(1) contains first captured group which is in.

When you remove \1 your regex becomes (\b\w+)\s+. Let's see what it's saying now.

Match a word(\w+) preceded by word boundary(\b) followed by one or many whitespaces(\s+).

So first occurrence of such pattern is Cherry.

m.group(0) now has the whole match Cherry.

Upvotes: 1

ʰᵈˑ
ʰᵈˑ

Reputation: 11375

(\b\w+)\s+\1

This is because you're matching a word (follow by a space), followed by the same match in group 1. Since in in is the first (and only sequence) of the same word in sequence, it matches.

For example, if you have Cherry tree tree [...], your match would be tree tree.

(\b\w+)\s+

This is simply finding a word followed by a space. Since Cherry is the first word, it matches.

But why it returns "in" for m.group(1)?

With re.search, m.group(0) holds the whole match, and m.group(1) holds the first capture group - which is in.

Upvotes: 1

Related Questions