Reputation: 117
I'm trying to remove stopwords from a tab-delimited .txt file using the following code:
import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
file = open('textposts_01.txt', encoding='UTF-8')
stop_words = set(stopwords.words('english'))
line = file.read()
words = line.split()
for r in words:
if not r in stop_words:
appendFile = open('textposts_02.txt', mode='a', encoding='UTF-8')
appendFile.write(" "+r)
appendFile.close()
The code executes successfully, but when I view the results all of rows have been re-written onto a single line. How can I maintain the columns while removing the stopwords?
I found the following solution on a similar post:
import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
file = open('textposts_01.txt', encoding='UTF-8')
stop_words = set(stopwords.words('english'))
line = file.read()
words = line.split()
for r in words:
if not r in stop_words:
appendFile = open('textposts_02.txt', mode='a', encoding='UTF-8')
appendFile.write(" "+r)
appendFile.write("\n")
appendFile.close()
But inserting a new line simply created a new line after every word so that if I started with a row like this:
0 make a list of every person you know
the results looked like this:
0
make
list
every
person
know
and I need the results in rows like so:
0 make list every person
I've been searching a while, but haven't found any solutions.
Upvotes: 0
Views: 54
Reputation: 2311
You can loop over the file and add a newline once you're done with each line.
Also, among other things, reading all of the file at once is not a very memory-friendly approach. Following is a better and safer approach:
stop_words = set(stopwords.words('english'))
with open('textposts_01.txt', encoding='UTF-8') as f:
with open('textposts_02.txt', mode='a', encoding='UTF-8') as append_file:
for line in f:
for r in line.split():
if r not in stop_words:
append_file.write(" "+r)
append_file.write("\n")
Upvotes: 1
Reputation: 189936
appendFile.write(" "+r)
will simply write the line without a newline. You probably want
appendFile.write(r + "\n")
instead.
Upvotes: 2