RJS
RJS

Reputation: 139

python extract sentences containing keyword(s)

I am writing a script to extract from a text file any sentence containing any one of several keywords.

The first version of the script is

    keywords=['coal','solar'] 

    fileinE =[“We provide detailed guidance on our equity coal capital 
    raising plans”,”First we are seizing the issuance of new shares under the 
    DRIP program with immediate effect”,”Resulting in a total of about $160 
    million of new share solar issued under the program in 2020”]
        
      
    fileinF=[] 

    for sent in fileinE:
    tokenized_sent=[word.lower() for word in word_tokenize(sent)]
    if any(keyw in tokenized_sent for keyw in keywords):
        fileinF.append(tokenized_sent)
        print (fileinF)
    
    [['we', 'provide', 'detailed', 'guidance', 'on', 'our', 'equity', 'coal', 
    'capital', 'raising', 'plans'], ['resulting', 'in', 'a', 'total', 'of', 
    'about', '$', '160', 'million', 'of', 'new', 'share', 'solar', 'issued', 
    'under', 'the', 'program', 'in', '2020']]

The script performed as iended.

I then changed the script to read in the stopwords from a file.

    with open ('KeywordsEDF A.txt','r')fileinF=[]

    print(keywords)

    for sent in fileinE:
        tokenized_sent=[word.lower() for word in word_tokenize(sent)]
        if any(keyw in tokenized_sent for keyw in keywords):
        fileinF.append(tokenized_sent)
        print (fileinF)
        
        ['coal','solar']


        ['resulting', 'in', 'a', 'total', 'of', 'about', '$', '160', 
        'million', 'of', 'new', 'share', 'solar', 'issued', 'under', 'the', 
        'program', 'in', '2020']]

There is a problem. The output (fileinF) does not contain the sentence [ 'we', 'provide', 'detailed', 'guidance', 'on', 'our', 'equity', 'coal', 'capital', 'raising', 'plans'] and the only difference that I see in the two scripts is that in the first the keywords are included within the script while in the second the are read in from a file.

Advice or insight in how to correct the problem will be appreciated.

Upvotes: 0

Views: 1058

Answers (2)

thorntonc
thorntonc

Reputation: 2126

Based on your provided code, I was able to produce a working output. Make sure to format your code correctly when you ask a question, as issues may be due to white space or other factors (quotes on list item were being broken by an apostrophe in "we’re").

from nltk import word_tokenize

'''
with open ('KeywordsEDF A.txt','r') as filein:
    keywords=filein.read()
'''

keywords = ['coal', 'solar']

fileinE = ["We provide detailed guidance on our equity coal capital raising plans",
           "First, we’re seizing the issuance of new shares under the DRIP program with immediate effect",
           "Resulting in a total of about $160 million of new share solar issued under the program in 2020"]

# extract sentences containing keywords
fileinF = []
for sent in fileinE:
    tokenized_sent = [word.lower() for word in word_tokenize(sent)]
    if any(keyw in tokenized_sent for keyw in keywords):
        fileinF.append(sent)
print(fileinF)

Assuming you want the original sentence and not a tokenized sentence, the output will be as below:

['We provide detailed guidance on our equity coal capital raising plans', 'Resulting in a total of about $160 million of new share solar issued under the program in 2020']

Upvotes: 1

Shiva_Adasule
Shiva_Adasule

Reputation: 839

That could help out

file = open('your_file_path').read().lower().split('\n') 
# To get all Sentences list from file

keywords = ['coal','solar']
result = [sen  for sen in file if any([key in sen for key in keywords])]

# All Sentences containing keywords will store in result

Upvotes: 0

Related Questions