TAN-C-F-OK
TAN-C-F-OK

Reputation: 179

remove stop words (NLTK) from multiple files

I have a couple tousend text files (local folder) and want to remove the stop words from each file in this folder and save the new files in a subfolder.

Code for one file:

import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))
file1 = open("1_1.txt")
line = file1.read()
words = line.split()
for r in words:
    if not r in stop_words:
        appendFile = open('subfolder/1_1.txt','a')
        appendFile.write(" "+r)
        appendFile.close()

I think I have to try it with glob? But I don't seem to unterstand the documentation. And I maybe should lower() the text? There has to be a super easy way, but I only find tutorials for a sentence or a file, never for multiple files.

Upvotes: 1

Views: 1319

Answers (1)

Sleeba Paul
Sleeba Paul

Reputation: 633

import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))
file1 = open("file1.txt")
line = file1.read()
words = word_tokenize(line)
words_witout_stop_words = ["" if word in stop_words else word for word in words]
new_words = " ".join(words_witout_stop_words).strip()
appendFile = open('subfolder/file1.txt','w')
appendFile.write(new_words)
appendFile.close()

Now you may add a loop through file names of your localfolder and you're good to go.

Upvotes: 2

Related Questions