Reputation: 81
I'm taking my first steps with Python and I have a problem to solve in which I need regex.
I'm parsing several lines of text and I need to grab 5 words before and after a certain match. The term to match is always the same, and lines can have more than one occurrence of that term.
r"(?i)((?:\S+\s+){0,5})<tag>(\w*)</tag>\s*((?:\S+\s+){0,5})"
This works in very specific situations: if there's only one occurence of the term between tags (or if they are well-spaced between them), and if there are enough words before the first occurrence.
The problem is:
1 - if a second occurrence is within the +5 range of the first occurrence, there are no -5 for the second, or the second just becomes engulfed by the first. Overlapping problem?
2 - if there are less than 5 words, or if you up the range to 7 or 8, it just skips the first occurrence to the second or third.
So a line that's something like:
word word word match word word match word word word
Would not be parsed well.
Is there a way to take into account these issues and make it work?
Thank you all in advance!
Upvotes: 2
Views: 353
Reputation: 1474
This might be what your after - without using regex
#!/usr/bin/env python
def find_words(s, count, needle):
# split the string into a list
lst = s.split()
# get the index of the where the needle is
idx = lst.index(needle)
# s is the start and end of the list you need
s = idx -count
e = idx +count
# print the list as slice notation
print lst[s:e+1]
def find_occurrences_in_list(s, count, needle):
# split the string into a list
lst = s.split()
idxList = [i for i, x in enumerate(lst) if x == needle]
# print idxList
r = []
for n in idxList:
s = n-count
e = n+count
# append the list as slice notation
r.append(" ".join(lst[s:e+1]))
print r
# the string of words
mystring1 = "zero one two three four five match six seven eight nine ten eleven"
# call function to find string, 5 words head & behind, looking for the word "match"
find_occurrences_in_list(mystring1, 5, "match")
# call function to find string, 3 words head & behind, looking for the word "nation"
mystring2 = "Four score and seven years ago our fathers brought forth on this continent a new nation conceived in Liberty and dedicated to the proposition"
find_occurrences_in_list(mystring2, 3, "nation")
mystring3 = "zero one two three four five match six seven match eight nine ten eleven"
find_occurrences_in_list(mystring3, 2, "match")
['one two three four five match six seven eight nine ten']
['continent a new nation conceived in Liberty']
['four five match six seven', 'six seven match eight nine']
Upvotes: 1