Vindhya G
Vindhya G

Reputation: 1369

Regular Expression in python

When the parenthesis were used in the below program output is ['www.google.com'].

import re
teststring = "href=\"www.google.com\""
m=re.findall('href="(.*?)"',teststring)
print m;

If parenthesis is removed in findall function output is ['href="www.google.com"'].

import re
teststring = "href=\"www.google.com\""
m=re.findall('href=".*?"',teststring)
print m;

Would be helpful if someone explained how it works.

Upvotes: 2

Views: 80

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121346

The re.findall() documentation is quite clear on the difference:

Return all non-overlapping matches of pattern in string, as a list of strings. […] If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

So .findall() returns a list containing one of three types of values, depending on the number of groups in the pattern:

  • 0 capturing groups in the pattern (no (...) parenthesis): the whole matched string ('href="www.google.com"' in your second example).
  • 1 capturing group in the pattern: return the captured group ('www.google.com' in your first example).
  • more than 1 capturing group in the pattern: return a tuple of all matched groups.

Use non-capturing groups ((?:...)) if you don't want that behaviour, or add groups if you want more information. For example, adding a group around the href= part would result in a list of tuples with two elements each:

>>> re.findall('(href=)"(.*?)"', teststring)
[('href=', 'www.google.com')]

Upvotes: 5

Related Questions