Reputation: 1445
I have to extract two words before and after my substring match in a large string. For example:
sub = 'name'
str = '''My name is Avi. Name identifies who you are. It is important to have a name starting with the letter A.'''
Now I have to find all occurences of sub in str and then return the following:
(My name is Avi), (Name identifies who), (have a name starting with)
Note that if the re is a full stop after the string than only the words before string are returned as shown in example above.
What I have tried?
>>> import re
>>> text = '''My name is Avi. Name identifies who you are. It is important to have a name starting with the letter A.'''
>>> for m in re.finditer( 'name', text ):
... print( 'name found', m.start(), m.end() )
Which gives me the starting and ending position of the matched substring. I am not able to proceed further as to how to find words around it.
Upvotes: 3
Views: 3020
Reputation: 97958
import re
sub = '(\w*)\W*(\w*)\W*(name)\W*(\w*)\W*(\w*)'
str1 = '''My name is Avi. Name identifies who you are. It is important to have a name starting with the letter A.'''
for i in re.findall(sub, str1, re.I):
print " ".join([x for x in i if x != ""])
Output
My name is Avi
Name identifies who
have a name starting with
or,
sub = '\w*\W*\w*\W*name\W*\w*\W*\w*'
for i in re.findall(sub, str1, re.I):
i=i.strip(" .")
print i
Upvotes: 5
Reputation: 21773
I present the seriously ugly:
(([^\s.]+)\s+)?(([^\s.]+)\s+)?(name[^\w\s]*)(\s+([^\s.]+))?(\s+([^\s.]+))?
Confirmed to work on http://www.regexpal.com/
The unit (([^\s.]+)\s+)
matches one word (Defined as non-whitespace non-.
) then one space sequence (the \s+
), and is entirely optional.
(name[^\w\s]*)
is the keyword you are searching for, followed by 0 or more non-word-character non-spaces (so that it will match name.
or name!
for example)
So our strategy is to explicitly bake that we want up to two words before and after our keyword into the regex used.
Make sure this regex is set re.IGNORECASE
: http://docs.python.org/2/library/re.html#re.IGNORECASE
I haven't tested to see if this regex is slow on large bodies of text or not.
Btw, if the keyword can only be one word long, then there is a much easier solution : split
your incoming string on " "
, then for every instance of your keyword in the split words, also grab up to two words before and after and join
on " "
. This will be much easier to read, understand, maintain and explain.
Upvotes: 4