Reputation: 139
I am writing a script to extract from a text file any sentence containing any one of several keywords.
The first version of the script is
keywords=['coal','solar']
fileinE =[“We provide detailed guidance on our equity coal capital
raising plans”,”First we are seizing the issuance of new shares under the
DRIP program with immediate effect”,”Resulting in a total of about $160
million of new share solar issued under the program in 2020”]
fileinF=[]
for sent in fileinE:
tokenized_sent=[word.lower() for word in word_tokenize(sent)]
if any(keyw in tokenized_sent for keyw in keywords):
fileinF.append(tokenized_sent)
print (fileinF)
[['we', 'provide', 'detailed', 'guidance', 'on', 'our', 'equity', 'coal',
'capital', 'raising', 'plans'], ['resulting', 'in', 'a', 'total', 'of',
'about', '$', '160', 'million', 'of', 'new', 'share', 'solar', 'issued',
'under', 'the', 'program', 'in', '2020']]
The script performed as iended.
I then changed the script to read in the stopwords from a file.
with open ('KeywordsEDF A.txt','r')fileinF=[]
print(keywords)
for sent in fileinE:
tokenized_sent=[word.lower() for word in word_tokenize(sent)]
if any(keyw in tokenized_sent for keyw in keywords):
fileinF.append(tokenized_sent)
print (fileinF)
['coal','solar']
['resulting', 'in', 'a', 'total', 'of', 'about', '$', '160',
'million', 'of', 'new', 'share', 'solar', 'issued', 'under', 'the',
'program', 'in', '2020']]
There is a problem. The output (fileinF) does not contain the sentence [ 'we', 'provide', 'detailed', 'guidance', 'on', 'our', 'equity', 'coal', 'capital', 'raising', 'plans'] and the only difference that I see in the two scripts is that in the first the keywords are included within the script while in the second the are read in from a file.
Advice or insight in how to correct the problem will be appreciated.
Upvotes: 0
Views: 1058
Reputation: 2126
Based on your provided code, I was able to produce a working output. Make sure to format your code correctly when you ask a question, as issues may be due to white space or other factors (quotes on list item were being broken by an apostrophe in "we’re").
from nltk import word_tokenize
'''
with open ('KeywordsEDF A.txt','r') as filein:
keywords=filein.read()
'''
keywords = ['coal', 'solar']
fileinE = ["We provide detailed guidance on our equity coal capital raising plans",
"First, we’re seizing the issuance of new shares under the DRIP program with immediate effect",
"Resulting in a total of about $160 million of new share solar issued under the program in 2020"]
# extract sentences containing keywords
fileinF = []
for sent in fileinE:
tokenized_sent = [word.lower() for word in word_tokenize(sent)]
if any(keyw in tokenized_sent for keyw in keywords):
fileinF.append(sent)
print(fileinF)
Assuming you want the original sentence and not a tokenized sentence, the output will be as below:
['We provide detailed guidance on our equity coal capital raising plans', 'Resulting in a total of about $160 million of new share solar issued under the program in 2020']
Upvotes: 1
Reputation: 839
That could help out
file = open('your_file_path').read().lower().split('\n')
# To get all Sentences list from file
keywords = ['coal','solar']
result = [sen for sen in file if any([key in sen for key in keywords])]
# All Sentences containing keywords will store in result
Upvotes: 0