Regular Expression in python

Question

When the parenthesis were used in the below program output is ['www.google.com'].

import re
teststring = "href=\"www.google.com\""
m=re.findall('href="(.*?)"',teststring)
print m;

If parenthesis is removed in findall function output is ['href="www.google.com"'].

import re
teststring = "href=\"www.google.com\""
m=re.findall('href=".*?"',teststring)
print m;

Would be helpful if someone explained how it works.

Martijn Pieters · Accepted Answer

The re.findall() documentation is quite clear on the difference:

Return all non-overlapping matches of pattern in string, as a list of strings. […] If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

So .findall() returns a list containing one of three types of values, depending on the number of groups in the pattern:

0 capturing groups in the pattern (no (...) parenthesis): the whole matched string ('href="www.google.com"' in your second example).
1 capturing group in the pattern: return the captured group ('www.google.com' in your first example).
more than 1 capturing group in the pattern: return a tuple of all matched groups.

Use non-capturing groups ((?:...)) if you don't want that behaviour, or add groups if you want more information. For example, adding a group around the href= part would result in a list of tuples with two elements each:

>>> re.findall('(href=)"(.*?)"', teststring)
[('href=', 'www.google.com')]

Regular Expression in python

Answers (1)

Related Questions