Reputation: 2295
I want to extract a portion of a large string. There's a target word and an upper bound on the number of words before and after that. The extracted substring must therefore contain the target word along with the upper bound words before and after it. The before and after part can contain lesser words if the target word is closer to the beginning or end of the text.
Eample string
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
Target word: laboris
words_before: 5
words_after: 2
Should return ['veniam, quis nostrud exercitation ullamco laboris nisi ut']
I thought of a couple of possible patterns but none of them worked. I guess it can also be done by simply traversing the string front and back from the target word. However a regex would definitely make things easier. Any help would be appreciated.
Upvotes: 8
Views: 309
Reputation: 474201
You can also approach it with nltk
and it's "concordance" method, inspired by Calling NLTK's concordance - how to get text before/after a word that was used?:
A concordance view shows us every occurrence of a given word, together with some context.
import nltk
def get_neighbors(input_text, word, before, after):
text = nltk.Text(nltk.tokenize.word_tokenize(input_text))
concordance_index = nltk.ConcordanceIndex(text.tokens)
offset = next(offset for offset in concordance_index.offsets(word))
return text.tokens[offset - before - 1: offset] + text.tokens[offset: offset + after + 1]
text = u"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
print(get_neighbors(text, 'laboris', 5, 2))
Prints 5 words/tokens before the target word and 2 after:
[u'veniam', u',', u'quis', u'nostrud', u'exercitation', u'ullamco', u'laboris', u'nisi', u'ut']
Upvotes: 2
Reputation: 5668
If you still want regex....
def find_context(word_, n_before, n_after, string_):
import re
b= '\w+\W+' * n_before
a= '\W+\w+' * n_after
pattern = '(' + b + word_ + a + ')'
print(re.search(pattern, string_).groups(1)[0])
find_context('laboris', 5, 2, st)
veniam, quis nostrud exercitation ullamco laboris nisi ut
find_context('culpa', 2, 2, st)
sunt in culpa qui officia
Upvotes: 3
Reputation: 22312
If you want to split words, you can use slice()
and split()
function. For example:
>>> text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, qu
is nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.".split()
>>> n = text.index('laboris')
>>> s = slice(n - 5, n + 3)
>>> text[s]
['veniam,', 'quis', 'nostrud', 'exercitation', 'ullamco', 'laboris', 'nisi', 'ut']
Upvotes: 5