Reputation: 3433
I am using regex to catch certain patterns in a text. then I would like to get the words BEFORE the pattern up to the first stopword. How can I proceed?
import re
string = 'bla bla, the document nr is 10101 where the other doc nr is 34454 with...'
rex = r'\b\d{1,5}\b'
stopwords = ['a', 'an', 'the']
print([match.group(0) for match in list(re.finditer(rex,string))])
This extracts the numbers: ['10101', '34454']
Now what I want is to get for every match two groups. This is the expected output catching the following:
Match 1 (the document nr is) (10101)
Match 2 (the other doc nr is) (34454)
So the numbers themselves and all the words preceeding the numbers up to a stop word of a list.
I started with this code:
rex_sw = r'\b' + r'\b|\b'.join(stopwords) + r'\b'
print(rex_sw)
any_word = rf"\b.*\b"
rex_total = rf"({rex_sw})({any_word})({rex})"
print(rex_total)
print([match.group(0) for match in list(re.finditer(rex_total,string))])
But this does not work of course since starts in a Stopword and go all the way to the last number.
So the pseudo regex is: get the number. go back till you find a stop word (max 10 words). get both groups in a match.
How to go about this?
Upvotes: 1
Views: 103
Reputation: 67988
(\b(?:a|an|the)\b(?:(?!\b(?:a|an|the)\b).)*?)(\b\d{1,5}\b)
You can try something like this.
See demo.
https://regex101.com/r/4selF0/1
Upvotes: 1