Reputation: 13
I am trying to remove stopwords from a text file. The text file comprises of 9000+ sentences, each on their own line.
The code seems to be working almost right but I am obviously missing something as the output file has removed the line structure from the text document, which I obviously want to remain.
Here is the code;
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
with open(r"C:\\pytest\twitter_problems.txt",'r', encoding="utf8") as inFile, open(r"C:\\pytest\twitter_problems_filtered.txt",'w', encoding="utf8") as outFile:
stop_words = set(stopwords.words('english'))
words = word_tokenize(inFile.read())
for w in words:
if w not in stop_words:
outFile.write(w)
outFile.close()
is there some kind of line tokenizer i should be using instead of word tokenize? I checked the nltk documents but I can't really make sense of it (I am still a total newbie at this stuff).
Upvotes: 1
Views: 887
Reputation: 184
I suggest reading the file line by line. Something like this might work:
with open(r"C:\\pytest\twitter_problems.txt",'r', encoding="utf8") as inFile, open(r"C:\\pytest\twitter_problems_filtered.txt",'w', encoding="utf8") as outFile:
stop_words = set(stopwords.words('english'))
for line in inFile.readlines():
words = word_tokenize(line)
filtered_words = " ".join(w for w in words if w not in stop_words)
outFile.write(filtered_words + '\n')
If the with
-statement works as intended you do not have to close the outFile after
Upvotes: 2
Reputation: 149185
If you want to preserve the line structure, simply read the file line by line and add a newline after each one:
with open(r"C:\\pytest\twitter_problems.txt",'r', encoding="utf8") as inFile, open(r"C:\\pytest\twitter_problems_filtered.txt",'w', encoding="utf8") as outFile:
stop_words = set(stopwords.words('english'))
for line in infile:
words = word_tokenize(line)
for w in words:
if w not in stop_words:
outFile.write(w)
output.write('\n')
Upvotes: 2