RJS
RJS

Reputation: 139

Python NLTK extract sentence containing a keyword

My objective is to extract sentences from a text file that contain any word that is in my list of keywords. My script cleans up the text file and uses NLTK to tokenize the sentences and remove stopwords. That part of the script works ok and produces output that looks correct ['affirming updated 2020 range guidance long-term earnings dividend growth outlooks provided earlier month', 'finally look forward increasing engagement existing prospective investors months come', 'turn'] The script that I wrote to extract sentences containing a keyword does not work the way I want. It extracts the keywords but not the sentences in which they occur. The output looks like this; [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'impact', 'zone']

    fileinC=nltk.sent_tokenize(fileinB)
    fileinD=[]
    for sent in fileinC:
        fileinD.append(' '.join(w for w in word_tokenize(sent) if w not in allinstops))
    fileinE=[sent.replace('\n', " ") for sent in fileinD]

    #extract sentences containing keywords
    fileinF=[]
        for sent in fileinE:
    fileinF.append(' '.join(w for w in word_tokenize(sent) if w  in keywords))

Upvotes: 2

Views: 1166

Answers (1)

thorntonc
thorntonc

Reputation: 2126

It is likely that the conditional append in your last line causes the issue, it is more intuitive to break it down into smaller steps like so:

fileinF = []
for sent in fileinE:
    # tokenize and lowercase tokens of the sentence
    tokenized_sent = [word.lower() for word in word_tokenize(sent)]
    # if any item in the tokenized sentence is a keyword, append the original sentence
    if any(keyw in tokenized_sent for keyw in keywords):
        fileinF.append(sent)

Upvotes: 1

Related Questions