Reputation: 614
Below I have text from which I want to extract the month (July in this case).
The word_pattern
makes sure that the text contains those words,
while the month_pattern
will extract the month. So first I verify text passage
contains certain words, and if it does, then I attempt to extract the month
When the patterns are used separately, they get a match, but if I try to combine them I end up with no matches. I can't figure out what I'm doing wrong.
import re
text = ''' The number of shares of the
registrant’s common stock outstanding as
of July 31, 2017 was 52,833,429.'''
# patterns
word_pattern = r'(?=.*outstanding[.,]?)(?=.*common)(?=.*shares)'
month_pattern = r'(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)'
pattern = word_pattern + month_pattern
print(re.search(pattern, text, flags = re.IGNORECASE|re.DOTALL))
Expected result:
<re.Match object; span=(73, 77), match='July'>
Upvotes: 2
Views: 84
Reputation: 18950
Regex cannot be easily concatenated like that. The issue is your word pattern only uses lookaheads and therefore does not move the position ahead which becomes a problem when the month only shows up mid-string. So, you need to allow the cursor to advance to the month position using a quantifier that bridges the gap, e.g. .*
Try
(?=.*outstanding[.,]?)(?=.*common)(?=.*shares).*(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)
Or pattern = word_pattern +'.*'+ month_pattern
should do the trick.
The result can be found in capture group 1: re.search(...).group(1)
Upvotes: 2