john
john

Reputation: 23

Count word frequency in multiple files

I am trying to write a code to count the frequency of word occurrences in a document containing about 10000 files, but instead of getting the overall frequency, I get the word count of the last file, since it will overwrite the previous iteration. My code so far is:

import csv
import glob
import re


def main():
    file_list = glob.glob(TARGET_FILES)
    for file in file_list:
        with open(file, 'r', encoding='UTF-8', errors='ignore') as f_in:
             doc = f_in.read()

 def get_data(doc):    

     vdictionary = {}
     w = csv.writer(open("output1.csv", "w",newline=''))
     tokens = re.findall('\w+', doc)  
     for token in tokens:
        if token not in vdictionary:
             vdictionary[token] = 1
        else:
             vdictionary[token] += 1
     for key, val in vdictionary.items():
        w.writerow([key, val])

Upvotes: 0

Views: 926

Answers (2)

Nathan
Nathan

Reputation: 3648

I think the problem is that you empty the csv file with every iteration. What happens if you use:

w = csv.writer(open("output1.csv", "a",newline=''))

instead of

w = csv.writer(open("output1.csv", "w",newline=''))

? I suspect you'll get a count for each file. If that is the case, you should make one dictionary, update that for each file and only at the end write it to the csv file.

You can get one dictionary like this:

 def get_data(doc, vdictionary):
     tokens = re.findall('\w+', doc)  
     for token in tokens:
         if token not in vdictionary:
             vdictionary[token] = 1
         else:
             vdictionary[token] += 1
     return vdictionary

 def main():
     files = {get your files}
     vdictionary = {}
     for file in files:
           vdictionary = get_data(file, vdictionary)
     w = csv.writer(open("output1.csv", "w",newline=''))
     for key, val in vdictionary.items():
        w.writerow([key, val])

Upvotes: 1

William
William

Reputation: 81

I think your issue is that every time you call get_data, you're rewriting the csv with only the counts from that file (I think). Instead, perhaps you could create a dictionary, then go through and do the counts of each word in each file for all files, then output to w.writerow([key, val]).

Essentially, do not output to the csv every time you go through a file. Go through all the files, updating one master dictionary, then output to a csv.

Upvotes: 1

Related Questions