TAN-C-F-OK
TAN-C-F-OK

Reputation: 179

Loop through files and save them separately

I want to loop trough a local folder with a couple thousand text files, remove the stop-words, and save the files in a sub-folder. My code loops through all files, but writes all text files in ONE new file. I need the files separated - as they where, and with the exact same filename, just without the stop-words. What am I doing wrong?

import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import glob
import os
import codecs

stop_words = set(stopwords.words('english'))

for afile in glob.glob("*.txt"):
    file1 = codecs.open(afile, encoding='utf-8')
    line = file1.read()
    words = word_tokenize(line)
    words_without_stop_words = [word for word in words if word not in stop_words]
    new_words = " ".join(words_without_stop_words).strip()
    appendFile = open('subfolder/file1.txt','w', encoding='utf-8')
    appendFile.write(new_words)
    appendFile.close()

I see that the filename(s) will be "file1" (line 11) - I just can't get my head around glob (if glob is even the solution?).

Upvotes: 1

Views: 2357

Answers (2)

Andrey Bulezyuk
Andrey Bulezyuk

Reputation: 197

Quick Solution:

import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import glob
import os
import codecs

stop_words = set(stopwords.words('english'))

for afile in glob.glob("*.txt"):
    file1 = codecs.open(afile, encoding='utf-8')
    line = file1.read()
    words = word_tokenize(line)
    words_without_stop_words = [word for word in words if word not in stop_words]
    new_words = " ".join(words_without_stop_words).strip()

    subfolder = getSubfolder(afile)
    filename = getFilename(afile)
    appendFile = open('{}/{}.txt'.format(subfolder,filename),'w', encoding='utf-8')
    appendFile.write(new_words)
    appendFile.close()

I've never worked with glob or codecs, i believe your problem lies in your last 3 lines of code. You use a constant string ('subfolder/file1.txt') as a final file target - that's why your results land in one file. I replaced the target path with two variables. These variables i get from the functions "getSubfolder()" and "getFilename()". You have to implement these functions in order to get the filename you need.

If i understand your goal correct, your filename stays the same, just in a different folder. Then you can use this line:

    appendFile = open('{}/{}.txt'.format('mysubfolder',afile),'w', encoding='utf-8')

Solution while learning:

enter image description here

I would recommend you to take a look at https://github.com/inducer/pudb and follow the execution of every step of your loop. This way you will see and learn what python does, what variable has what value at a certain point in time, and so on.

Upvotes: 1

Mr Alihoseiny
Mr Alihoseiny

Reputation: 1229

The reason is you are using same name in loop. You should change the name of file in each iteration. for example you can try this:

counter = 0 # This line added
for afile in glob.glob("*.txt"):
    file1 = codecs.open(afile, encoding='utf-8')
    line = file1.read()
    words = word_tokenize(line)
    words_without_stop_words = [word for word in words if word not in stop_words]
    new_words = " ".join(words_without_stop_words).strip()
    appendFile = open('subfolder/file1' + str(counter) + ".txt",'w', encoding='utf-8') # This line changed
    appendFile.write(new_words)
    appendFile.close()
    counter += 1 # This line added

What here happened is: we added a counter variable and add that number at the end of the name of each file.

At the end of the loop we increase the counter for separating files.

you can try different things like adding original file name at the end of the new file name.

Upvotes: 1

Related Questions