cwinhall
cwinhall

Reputation: 13

Python: Stopwords to txt file output is not per line

I am trying to remove stopwords from a text file. The text file comprises of 9000+ sentences, each on their own line.

The code seems to be working almost right but I am obviously missing something as the output file has removed the line structure from the text document, which I obviously want to remain.

Here is the code;

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

with open(r"C:\\pytest\twitter_problems.txt",'r', encoding="utf8") as inFile, open(r"C:\\pytest\twitter_problems_filtered.txt",'w', encoding="utf8") as outFile:
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(inFile.read())
    for w in words:
        if w not in stop_words:
            outFile.write(w)
outFile.close()

is there some kind of line tokenizer i should be using instead of word tokenize? I checked the nltk documents but I can't really make sense of it (I am still a total newbie at this stuff).

Upvotes: 1

Views: 887

Answers (2)

smernst
smernst

Reputation: 184

I suggest reading the file line by line. Something like this might work:

with open(r"C:\\pytest\twitter_problems.txt",'r', encoding="utf8") as inFile, open(r"C:\\pytest\twitter_problems_filtered.txt",'w', encoding="utf8") as outFile:
    stop_words = set(stopwords.words('english'))
    for line in inFile.readlines():
        words = word_tokenize(line)
        filtered_words = " ".join(w for w in words if w not in stop_words)
        outFile.write(filtered_words + '\n')

If the with-statement works as intended you do not have to close the outFile after

Upvotes: 2

Serge Ballesta
Serge Ballesta

Reputation: 149185

If you want to preserve the line structure, simply read the file line by line and add a newline after each one:

with open(r"C:\\pytest\twitter_problems.txt",'r', encoding="utf8") as inFile, open(r"C:\\pytest\twitter_problems_filtered.txt",'w', encoding="utf8") as outFile:
    stop_words = set(stopwords.words('english'))
    for line in infile:
        words = word_tokenize(line)
        for w in words:
            if w not in stop_words:
                outFile.write(w)
        output.write('\n')

Upvotes: 2

Related Questions