user2963623
user2963623

Reputation: 2295

Python regex to extract a portion of string

I want to extract a portion of a large string. There's a target word and an upper bound on the number of words before and after that. The extracted substring must therefore contain the target word along with the upper bound words before and after it. The before and after part can contain lesser words if the target word is closer to the beginning or end of the text.

Eample string

"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

Target word: laboris

words_before: 5

words_after: 2

Should return ['veniam, quis nostrud exercitation ullamco laboris nisi ut']

I thought of a couple of possible patterns but none of them worked. I guess it can also be done by simply traversing the string front and back from the target word. However a regex would definitely make things easier. Any help would be appreciated.

Upvotes: 8

Views: 309

Answers (3)

alecxe
alecxe

Reputation: 474201

You can also approach it with nltk and it's "concordance" method, inspired by Calling NLTK's concordance - how to get text before/after a word that was used?:

A concordance view shows us every occurrence of a given word, together with some context.

import nltk


def get_neighbors(input_text, word, before, after):
    text = nltk.Text(nltk.tokenize.word_tokenize(input_text))

    concordance_index = nltk.ConcordanceIndex(text.tokens)
    offset = next(offset for offset in concordance_index.offsets(word))

    return text.tokens[offset - before - 1: offset] + text.tokens[offset: offset + after + 1]

text = u"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."  
print(get_neighbors(text, 'laboris', 5, 2))

Prints 5 words/tokens before the target word and 2 after:

[u'veniam', u',', u'quis', u'nostrud', u'exercitation', u'ullamco', u'laboris', u'nisi', u'ut']

Upvotes: 2

LetzerWille
LetzerWille

Reputation: 5668

If you still want regex....

def find_context(word_, n_before, n_after, string_):
    import re

    b= '\w+\W+'  * n_before
    a=  '\W+\w+' * n_after
    pattern = '(' + b + word_ + a + ')'

    print(re.search(pattern, string_).groups(1)[0])


find_context('laboris', 5, 2, st)

veniam, quis nostrud exercitation ullamco laboris nisi ut

find_context('culpa', 2, 2, st)

sunt in culpa qui officia

Upvotes: 3

Remi Guan
Remi Guan

Reputation: 22312

If you want to split words, you can use slice() and split() function. For example:

>>> text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
 tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, qu
is nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
 Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
 fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in 
culpa qui officia deserunt mollit anim id est laborum.".split()

>>> n = text.index('laboris')
>>> s = slice(n - 5, n + 3)

>>> text[s]
['veniam,', 'quis', 'nostrud', 'exercitation', 'ullamco', 'laboris', 'nisi', 'ut']

Upvotes: 5

Related Questions