Plug4
Plug4

Reputation: 3928

Python: Search for words before and after a pair of keywords

I use the following code to open a text file, remove the HTML, and search for words before and after a certain keyword:

import nltk
import re

text = nltk.clean_html(open('file.txt').read())
text = text.lower()

pattern = re.compile(r'''(?x) ([^\(\)0-9]\.)+ | \w+(-\w+)* |  \.\.\. ''')
text = nltk.regexp_tokenize(text, pattern)

#remove the digits from text
text = [i for i in text if not i.isdigit()]

# Text is now a list of words from file.txt
# I now loop over the Text to find all words before and after a specific keyword

keyword = ['foreign']
for i, w in enumerate(text):  #it gives to the list items numbers
    if w in keyword:
        before_word = text[i-5:i-1] if i > 0 else ''
        before_word = ' '.join(word for word in before_word)
        after_word = text[i+1:i+5] if i+1 < len(text) else ''
        after_word = ' '.join(word for word in after_word)
        print "%s <%s> %s" % (before_word, w, after_word)

This codes works well if keyword is one word. But what if I want to find the 5 words before and after 'foreign currency' ? The issue is that in text all words separated by a space is a different item in the text list. I can't do keyword = ['foreign currency']. How can I solve this issue?

Sample .txt file here.

Upvotes: 2

Views: 2063

Answers (2)

TessellatingHeckler
TessellatingHeckler

Reputation: 28983

Have you considered a regex?

This will match and capture five words before, and five words after, foreign currency

((\w+ ){5})foreign currency(( \w+){5})

Edit: this regex breaks on things like tabs, quotes, commas, parentheses, etc. And the provided 'sample of words to be found' doesn't have 5 following words, so it wouldn't match that.

Here's an updated regex which is 5 words up to, and 1-5 words following, the phrase uses 'non-space' characters separated by 'non-word' characters for the words, and it captures as one group including the search text:

((\S+\W){5}foreign currency(\W\S+){1,5})

Otherwise, you could try:

  1. Join the text all into one line, no newlines
  2. Use something = text.find('foreign currency') to find the first position of that text
  3. Count backwards from there, character by character looking for spaces, for 5 words
  4. Count forwards from the end, character by character looking for spaces, for 5 words
  5. Loop all of this, using something = text.find('foreign currency', previous_end_pos) to tell it to look starting after the end of the previous step, to find the next instance.

Upvotes: 3

brian
brian

Reputation: 131

Have you thought about using a variable for the number of words in the "keyword" and iterating through the text by that number of items at a time?

Upvotes: 0

Related Questions