Sarchophagi
Sarchophagi

Reputation: 386

Python: Grab text before and after a keyword

keywords = ("banana", "apple", "orange", ...)
before = 50
after = 100
TEXT = "a big text string,  i.e., a page of a book"

for k in keywords:
    if k in TEXT:
        #cut = portion of text starting 'beforeText' chars before occurrence of 'k' and ending 'afterText' chars after occurrence of 'k'
        #finalcut = 'cut' with first and last WORDS trimmed to assure starting words are not cut in the middle

Guys, could you help me coding cut and finalcut string variables in the above example?

What is the most efficient solution considering I'm dealing with big texts, numerous pages and maybe more than 20 keywords to search?

Upvotes: 1

Views: 965

Answers (3)

Yoel
Yoel

Reputation: 9614

import string
import re

alphabet = string.lowercase + string.uppercase
regex1 = re.compile("(%s)" % "|".join(keywords))
regex2 = re.compile("^(%s)" % "|".join(keywords))
regex3 = re.compile("(%s)$" % "|".join(keywords))

for match in regex1.finditer(TEXT):
    cut = TEXT[max(match.start() - before, 0) : match.end() + after]
    finalcut = cut
    if not regex2.search(cut):
        finalcut = finalcut.lstrip(alphabet)
    if not regex3.search(cut):
        finalcut = finalcut.rstrip(alphabet)
    print cut, finalcut

This can be further improved, since only twice the keyword can be at the beginning or at the end of the text, and thus shouldn't be removed.

cuts = [TEXT[max(match.start() - before, 0) : match.end() + after] for match in regex1.finditer(TEXT)]
finalcuts = [0] * len(cuts)
for i, cut in enumerate(cuts):
    if i == 0 and not regex2.search(cut):
        finalcuts[0] = cuts[0].lstrip(alphabet)
    elif i == 0:
        finalcuts[0] = cuts[0]
    if i == len(cuts) - 1 and not regex3.search(cut):
            if i == 0:
                finalcuts[i] = finalcuts[i].rstrip(alphabet)
            elif i > 0:
                finalcuts[i] = cuts[i].rstrip(alphabet)
    elif i > 0:
        finalcuts[i] = cuts[i].strip(alphabet)
print cuts, finalcuts

Upvotes: 0

roippi
roippi

Reputation: 25964

You need to adjust your algorithm. As written it is O(n*m), n being # of keywords and m being the length of your text. That will NOT scale well.

Instead:

  • Make keywords a set, not a tuple. You only care about membership testing against keywords, and set membership tests are O(1).
  • You need to tokenize TEXT. This is a little more complicated than just doing split() since you need to handle removing punctuation/line breaks as well.
  • Finally, iterate over your tokens using a "sliding window" iterator, in chunks of 3. If the middle token is in your keywords set, grab the tokens around it and proceed.

That's it. So, some pseudo-ish code:

keywords = {"banana", "apple", "orange", ...}
tokens = tokenize(TEXT)

for before, target, after in window(tokens, n=3):
    if target in keywords:
        #do stuff with `before` and `after`

Where window is your choice of sliding window implementations like those here, and tokenize is either your own implementation involving split and strip, or perhaps ntlk.tokenize if you want a library solution.

Upvotes: 3

hlt
hlt

Reputation: 6317

You can find all matches in a string using re.finditer. Each of the match objects has a start() method you can use to figure out the position in the string. You also won't need to check if the key is in the string, because then finditer returns an empty iterator:

keywords = ("banana", "apple", "orange", ...)
before = 50
after = 100
TEXT = "a big text string,  i.e., a page of a book"

for k in keywords:
    for match in re.finditer(k, TEXT):
        position = match.start()
        cut = TEXT[max(position - before, 0):position + after] # max is needed because that index must not be negative
        trimmed_match = re.match("\w*?\W+(.*)\W+\w*", cut, re.MULTILINE)
        finalcut = trimmed_match.group(1)

The regex trims everything up to and including the first sequence of non-word chars and everything from and including the last sequence of non-word characters (I added re.MULTILINE in case there are newlines in your text)

Upvotes: 3

Related Questions