Reputation: 2871
I'm having trouble converting a Perl regex to Python. The text I'm trying to match has the following pattern:
Author(s) : Firstname Lastname Firstname Lastname Firstname Lastname Firstname Lastname
In perl I was able to match this and extract the authors with
/Author\(s\) :((.+\n)+?)/
When I try
re.compile(r'Author\(s\) :((.+\n)+?)')
in Python, it matches the first author twice and ignores the rest.
Can anyone explain what I am doing wrong here?
Upvotes: 4
Views: 1554
Reputation: 22415
You can do this:
# find lines with authors
import re
# multiline string to simulate possible input
text = '''
Stuff before
This won't be matched...
Author(s) : Firstname Lastname
Firstname Lastname
Firstname Lastname
Firstname Lastname
Other(s) : Something else we won't match
More shenanigans....
Only the author names will be matched.
'''
# run the regex to pull author lines from the sample input
authors = re.search(r'Author\(s\)\s*:\s*(.*?)^[^\s]', text, re.DOTALL | re.MULTILINE).group(1)
The above regex matches the beginning text (Author(s), whitespace, colon, whitespace) and it gives you the results below by matching all lines afterward that begin with whitespace:
'''Firstname Lastname
Firstname Lastname
Firstname Lastname
Firstname Lastname
'''
You can then use the below regex to group all authors from those results
# grab authors from the lines
import re
authors = '''Firstname Lastname
Firstname Lastname
Firstname Lastname
Firstname Lastname
'''
# run the regex to pull a list of individual authors from the author lines
authors = re.findall(r'^\s*(.+?)\s*$', authors, re.MULTILINE)
Which gives you the list of authors:
['Firstname Lastname', 'Firstname Lastname', 'Firstname Lastname', 'Firstname Lastname']
Combined example code:
text = '''
Stuff before
This won't be matched...
Author(s) : Firstname Lastname
Firstname Lastname
Firstname Lastname
Firstname Lastname
Other(s) : Something else we won't match
More shenanigans....
Only the author names will be matched.
'''
import re
stage1 = re.compile(r'Author\(s\)\s*:\s*(.*?)^[^\s]', re.DOTALL | re.MULTILINE)
stage2 = re.compile('^\s*(.+?)\s*$', re.MULTILINE)
preliminary = stage1.search(text).group(1)
authors = stage2.findall(preliminary)
Which sets authors to:
['Firstname Lastname', 'Firstname Lastname', 'Firstname Lastname', 'Firstname Lastname']
Success!
Upvotes: 3
Reputation: 127467
Try
re.compile(r'Author\(s\) :((.+\n)+)')
In your original expression, the +?
indicated that you want the match non-greedy, i.e. minimal.
Upvotes: 1
Reputation: 387785
One group can only match a single time. So even if your matching group is repeated, you can only access the last actual match. You'll have to match all names at once and split them then (via newline or even new regexps).
Upvotes: 2