Reputation: 319
I need a little script to filter literature references from a given text. The references can be in two formats:
bla bla bla (Snowden, 2014a) bla bla bla (Bush and Blair, 2005) bla bla bla.
Finding these references with two searches works find:
matches1 = re.findall('\([A-Z]\w*,\s?\d\d\d\d[a-z]?\)', line)
matches2 = re.findall('\([A-Z]\w* and [A-Z]\w*,\s?\d\d\d\d[a-z]?\)', line)
These searches correctly find (Snowden, 2014a) resp. (Bush and Blair, 2005). But now I want to find both kind of references in one search and it fails:
matches1 = re.findall('\([A-Z]\w*( and [A-Z]\w*)?,\s?\d\d\d\d[a-z]?\)', line)
This search returns '' instead of (Snowden, 2014a) and ' and Blair' instead of (Bush and Blair, 2005). Its not clear to me why this happens or what I've done wrong, so any help is appreciated :)
Thanks!
Upvotes: 0
Views: 56
Reputation: 174696
Just turn the capturing group to non-capturing group and reduce \d\d\d\d
to \d{4}
. Because re.findall
gives the first preference to the groups. If there any groups is present, it would print only the chars present inside the groups. And it forget about the matched strings.
\([A-Z]\w*(?: and [A-Z]\w*)?,\s?\d{4}[a-z]?\)
Sample Code:
>>> import re
>>> s = """foo bar (Snowden, 2014a)
... (Bush and Blair, 2005) foo bar"""
>>> m = re.findall(r'\([A-Z]\w*(?: and [A-Z]\w*)?,\s?\d{4}[a-z]?\)', s, re.M)
>>> for i in m:
... print i
...
(Snowden, 2014a)
(Bush and Blair, 2005)
Upvotes: 2
Reputation: 82889
Make your optional group non-capturing by adding ?:
:
In [8]: re.findall('\([A-Z]\w*(?: and [A-Z]\w*)?,\s?\d\d\d\d[a-z]?\)', line)
Out[8]: ['(Snowden, 2014a)', '(Bush and Blair, 2005)']
Upvotes: 0