Reputation: 3928
I use the following code to open a text file, remove the HTML, and search for words before and after a certain keyword:
import nltk
import re
text = nltk.clean_html(open('file.txt').read())
text = text.lower()
pattern = re.compile(r'''(?x) ([^\(\)0-9]\.)+ | \w+(-\w+)* | \.\.\. ''')
text = nltk.regexp_tokenize(text, pattern)
#remove the digits from text
text = [i for i in text if not i.isdigit()]
# Text is now a list of words from file.txt
# I now loop over the Text to find all words before and after a specific keyword
keyword = ['foreign']
for i, w in enumerate(text): #it gives to the list items numbers
if w in keyword:
before_word = text[i-5:i-1] if i > 0 else ''
before_word = ' '.join(word for word in before_word)
after_word = text[i+1:i+5] if i+1 < len(text) else ''
after_word = ' '.join(word for word in after_word)
print "%s <%s> %s" % (before_word, w, after_word)
This codes works well if keyword
is one word. But what if I want to find the 5 words before and after 'foreign currency'
? The issue is that in text
all words separated by a space is a different item in the text
list. I can't do keyword = ['foreign currency']
. How can I solve this issue?
Sample .txt file here.
Upvotes: 2
Views: 2063
Reputation: 28983
Have you considered a regex?
This will match and capture five words before, and five words after, foreign currency
((\w+ ){5})foreign currency(( \w+){5})
Edit: this regex breaks on things like tabs, quotes, commas, parentheses, etc. And the provided 'sample of words to be found' doesn't have 5 following words, so it wouldn't match that.
Here's an updated regex which is 5 words up to, and 1-5 words following, the phrase uses 'non-space' characters separated by 'non-word' characters for the words, and it captures as one group including the search text:
((\S+\W){5}foreign currency(\W\S+){1,5})
Otherwise, you could try:
something = text.find('foreign currency')
to find the first position of that textsomething = text.find('foreign currency', previous_end_pos)
to tell it to look starting after the end of the previous step, to find the next instance.Upvotes: 3
Reputation: 131
Have you thought about using a variable for the number of words in the "keyword" and iterating through the text by that number of items at a time?
Upvotes: 0