Loop through files and save them separately

Question

I want to loop trough a local folder with a couple thousand text files, remove the stop-words, and save the files in a sub-folder. My code loops through all files, but writes all text files in ONE new file. I need the files separated - as they where, and with the exact same filename, just without the stop-words. What am I doing wrong?

import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import glob
import os
import codecs

stop_words = set(stopwords.words('english'))

for afile in glob.glob("*.txt"):
    file1 = codecs.open(afile, encoding='utf-8')
    line = file1.read()
    words = word_tokenize(line)
    words_without_stop_words = [word for word in words if word not in stop_words]
    new_words = " ".join(words_without_stop_words).strip()
    appendFile = open('subfolder/file1.txt','w', encoding='utf-8')
    appendFile.write(new_words)
    appendFile.close()

I see that the filename(s) will be "file1" (line 11) - I just can't get my head around glob (if glob is even the solution?).

Andrey Bulezyuk · Accepted Answer

Quick Solution:

import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import glob
import os
import codecs

stop_words = set(stopwords.words('english'))

for afile in glob.glob("*.txt"):
    file1 = codecs.open(afile, encoding='utf-8')
    line = file1.read()
    words = word_tokenize(line)
    words_without_stop_words = [word for word in words if word not in stop_words]
    new_words = " ".join(words_without_stop_words).strip()

    subfolder = getSubfolder(afile)
    filename = getFilename(afile)
    appendFile = open('{}/{}.txt'.format(subfolder,filename),'w', encoding='utf-8')
    appendFile.write(new_words)
    appendFile.close()

I've never worked with glob or codecs, i believe your problem lies in your last 3 lines of code. You use a constant string ('subfolder/file1.txt') as a final file target - that's why your results land in one file. I replaced the target path with two variables. These variables i get from the functions "getSubfolder()" and "getFilename()". You have to implement these functions in order to get the filename you need.

If i understand your goal correct, your filename stays the same, just in a different folder. Then you can use this line:

    appendFile = open('{}/{}.txt'.format('mysubfolder',afile),'w', encoding='utf-8')

Solution while learning:

I would recommend you to take a look at https://github.com/inducer/pudb and follow the execution of every step of your loop. This way you will see and learn what python does, what variable has what value at a certain point in time, and so on.

Loop through files and save them separately

Answers (2)

Related Questions