Reputation: 386
keywords = ("banana", "apple", "orange", ...)
before = 50
after = 100
TEXT = "a big text string, i.e., a page of a book"
for k in keywords:
if k in TEXT:
#cut = portion of text starting 'beforeText' chars before occurrence of 'k' and ending 'afterText' chars after occurrence of 'k'
#finalcut = 'cut' with first and last WORDS trimmed to assure starting words are not cut in the middle
Guys, could you help me coding cut
and finalcut
string variables in the above example?
What is the most efficient solution considering I'm dealing with big texts, numerous pages and maybe more than 20 keywords to search?
Upvotes: 1
Views: 965
Reputation: 9614
import string
import re
alphabet = string.lowercase + string.uppercase
regex1 = re.compile("(%s)" % "|".join(keywords))
regex2 = re.compile("^(%s)" % "|".join(keywords))
regex3 = re.compile("(%s)$" % "|".join(keywords))
for match in regex1.finditer(TEXT):
cut = TEXT[max(match.start() - before, 0) : match.end() + after]
finalcut = cut
if not regex2.search(cut):
finalcut = finalcut.lstrip(alphabet)
if not regex3.search(cut):
finalcut = finalcut.rstrip(alphabet)
print cut, finalcut
This can be further improved, since only twice the keyword can be at the beginning or at the end of the text, and thus shouldn't be removed.
cuts = [TEXT[max(match.start() - before, 0) : match.end() + after] for match in regex1.finditer(TEXT)]
finalcuts = [0] * len(cuts)
for i, cut in enumerate(cuts):
if i == 0 and not regex2.search(cut):
finalcuts[0] = cuts[0].lstrip(alphabet)
elif i == 0:
finalcuts[0] = cuts[0]
if i == len(cuts) - 1 and not regex3.search(cut):
if i == 0:
finalcuts[i] = finalcuts[i].rstrip(alphabet)
elif i > 0:
finalcuts[i] = cuts[i].rstrip(alphabet)
elif i > 0:
finalcuts[i] = cuts[i].strip(alphabet)
print cuts, finalcuts
Upvotes: 0
Reputation: 25964
You need to adjust your algorithm. As written it is O(n*m), n being # of keywords and m being the length of your text. That will NOT scale well.
Instead:
keywords
a set
, not a tuple
. You only care about membership testing against keywords
, and set membership tests are O(1).TEXT
. This is a little more complicated than just doing split()
since you need to handle removing punctuation/line breaks as well.keywords
set, grab the tokens around it and proceed. That's it. So, some pseudo-ish code:
keywords = {"banana", "apple", "orange", ...}
tokens = tokenize(TEXT)
for before, target, after in window(tokens, n=3):
if target in keywords:
#do stuff with `before` and `after`
Where window
is your choice of sliding window implementations like those here, and tokenize
is either your own implementation involving split
and strip
, or perhaps ntlk.tokenize
if you want a library solution.
Upvotes: 3
Reputation: 6317
You can find all matches in a string using re.finditer
. Each of the match objects has a start()
method you can use to figure out the position in the string. You also won't need to check if the key is in the string, because then finditer
returns an empty iterator:
keywords = ("banana", "apple", "orange", ...)
before = 50
after = 100
TEXT = "a big text string, i.e., a page of a book"
for k in keywords:
for match in re.finditer(k, TEXT):
position = match.start()
cut = TEXT[max(position - before, 0):position + after] # max is needed because that index must not be negative
trimmed_match = re.match("\w*?\W+(.*)\W+\w*", cut, re.MULTILINE)
finalcut = trimmed_match.group(1)
The regex trims everything up to and including the first sequence of non-word chars and everything from and including the last sequence of non-word characters (I added re.MULTILINE
in case there are newlines in your text)
Upvotes: 3