python regex catch pattern and previous words up to a stop word

Question

I am using regex to catch certain patterns in a text. then I would like to get the words BEFORE the pattern up to the first stopword. How can I proceed?

import re 
string = 'bla bla, the document nr is 10101 where the other doc nr is 34454 with...'
rex = r'\b\d{1,5}\b'
stopwords = ['a', 'an', 'the']
print([match.group(0) for match in list(re.finditer(rex,string))])

This extracts the numbers: ['10101', '34454']

Now what I want is to get for every match two groups. This is the expected output catching the following:

Match 1 (the document nr is) (10101)

Match 2 (the other doc nr is) (34454)

So the numbers themselves and all the words preceeding the numbers up to a stop word of a list.

I started with this code:

rex_sw = r'\b' + r'\b|\b'.join(stopwords) + r'\b'
print(rex_sw)
any_word = rf"\b.*\b"
rex_total = rf"({rex_sw})({any_word})({rex})"
print(rex_total)
print([match.group(0) for match in list(re.finditer(rex_total,string))])

But this does not work of course since starts in a Stopword and go all the way to the last number.

So the pseudo regex is: get the number. go back till you find a stop word (max 10 words). get both groups in a match.

How to go about this?

python regex catch pattern and previous words up to a stop word

Answers (1)

Related Questions