Reputation: 160
I'm having a bit of an issue with my regex script and hopefully somebody can help me out.
Basically, I have a regex script that I use re.findall() with in a python script. My goal is to search various strings of varying length for references to Bible verses (e.g. John 3:16, Romans 6, etc). My regex script mostly works, but it sometimes tacks on an extra whitespace before the Bible book name. Here's the script:
versesToFind = re.findall(r'\d?\s?\w+\s\d+:?\d*', str)
To hopefully explain this problem better, here's my results when running this script on this text string:
str = 'testing testing John 3:16 adsfbaf John 2 1 Kings 4 Romans 4'
Result (from www.pythonregex.com):
[u' John 3:16', u' John 2', u'1 Kings 4', u' Romans 4']
As you can see, John 2 and Romans 4 has an extra whitespace at the beginning that I want to get rid of. Hopefully my explanation makes sense. Thanks in advance!
Upvotes: 0
Views: 1278
Reputation: 59974
Instead of rewriting your regular expression, you can always just strip()
the whitespace:
>>> L = [u' John 3:16', u' John 2', u'1 Kings 4', u' Romans 4']
>>> print map(unicode.strip, L)
[u'John 3:16', u'John 2', u'1 Kings 4', u'Romans 4']
map()
here is just identical to:
>>> print [i.strip() for i in L]
[u'John 3:16', u'John 2', u'1 Kings 4', u'Romans 4']
Upvotes: 0
Reputation: 2078
Using list comprehension you can do it in a single line:
versesToFind = [x.strip() for x in re.findall(r'\d?\s?\w+\s\d+:?\d*', str)]
Upvotes: 0
Reputation: 26397
You can make the digit and space optional as a single unit by grouping with parens (?:
just to specify it's non-capturing),
'(?:\d\s)?\w+\s\d+:?\d*'
^^^ ^
Which produces,
>>> s = 'testing testing John 3:16 adsfbaf John 2 1 Kings 4 Romans 4'
>>> re.findall(r'(?:\d\s)?\w+\s\d+:?\d*', s)
['John 3:16', 'John 2', '1 Kings 4', 'Romans 4']
Upvotes: 1