Eric Seidel
Eric Seidel

Reputation: 2871

Converting Perl Regular Expressions to Python Regular Expressions

I'm having trouble converting a Perl regex to Python. The text I'm trying to match has the following pattern:

Author(s)    : Firstname Lastname  
               Firstname Lastname  
               Firstname Lastname  
               Firstname Lastname

In perl I was able to match this and extract the authors with

/Author\(s\)    :((.+\n)+?)/

When I try

re.compile(r'Author\(s\)    :((.+\n)+?)')

in Python, it matches the first author twice and ignores the rest.

Can anyone explain what I am doing wrong here?

Upvotes: 4

Views: 1554

Answers (3)

lunixbochs
lunixbochs

Reputation: 22415

You can do this:

# find lines with authors
import re

# multiline string to simulate possible input
text = '''
Stuff before
This won't be matched...
Author(s)    : Firstname Lastname  
               Firstname Lastname  
               Firstname Lastname  
               Firstname Lastname
Other(s)     : Something else we won't match
               More shenanigans....
Only the author names will be matched.
'''

# run the regex to pull author lines from the sample input
authors = re.search(r'Author\(s\)\s*:\s*(.*?)^[^\s]', text, re.DOTALL | re.MULTILINE).group(1)

The above regex matches the beginning text (Author(s), whitespace, colon, whitespace) and it gives you the results below by matching all lines afterward that begin with whitespace:

'''Firstname Lastname  
           Firstname Lastname  
           Firstname Lastname  
           Firstname Lastname
'''

You can then use the below regex to group all authors from those results

# grab authors from the lines
import re
authors = '''Firstname Lastname  
           Firstname Lastname  
           Firstname Lastname  
           Firstname Lastname
'''

# run the regex to pull a list of individual authors from the author lines
authors = re.findall(r'^\s*(.+?)\s*$', authors, re.MULTILINE)

Which gives you the list of authors:

['Firstname Lastname', 'Firstname Lastname', 'Firstname Lastname', 'Firstname Lastname']

Combined example code:

text = '''
Stuff before
This won't be matched...
Author(s)    : Firstname Lastname  
               Firstname Lastname  
               Firstname Lastname  
               Firstname Lastname
Other(s)     : Something else we won't match
               More shenanigans....
Only the author names will be matched.
'''

import re
stage1 = re.compile(r'Author\(s\)\s*:\s*(.*?)^[^\s]', re.DOTALL | re.MULTILINE)
stage2 = re.compile('^\s*(.+?)\s*$', re.MULTILINE)

preliminary = stage1.search(text).group(1)
authors = stage2.findall(preliminary)

Which sets authors to:

['Firstname Lastname', 'Firstname Lastname', 'Firstname Lastname', 'Firstname Lastname']

Success!

Upvotes: 3

Martin v. Löwis
Martin v. Löwis

Reputation: 127467

Try

re.compile(r'Author\(s\)    :((.+\n)+)')

In your original expression, the +? indicated that you want the match non-greedy, i.e. minimal.

Upvotes: 1

poke
poke

Reputation: 387785

One group can only match a single time. So even if your matching group is repeated, you can only access the last actual match. You'll have to match all names at once and split them then (via newline or even new regexps).

Upvotes: 2

Related Questions