Alph
Alph

Reputation: 391

Creating a dictionary of word count of multiple text files in a directory

I am using build_dict() function in word_count_directory() function to create a dictionary of word counts of three files in a directory. I want to create three dictionaries (one at a time for each file) and update previous dictionary. My code creates a single dictionary (word_count) that combining all three dictionaries at same time. I was wondering how to accomplish this?

def build_dict(filename):
   f = open(filename, 'rU')
   words = f.read().split()
   count = {}

   for word in words:
      word = word.lower()
      if word not in count:
        count[word] = 1
      else:
        count[word] += 1

   f.close()

   return count
## print build_dict("C:\\Users\\Phil2040\\Desktop\\word_count\\news1.txt")

import os
import os.path
def word_count_directory(directory):
    wordcount={}
    filelist=[os.path.join(directory,f) for f in os.listdir(directory)]
    for file in filelist:
       wordcount=build_dict(file)  # calling build_dict function
    return wordcount
print word_count_directory("C:\\Users\\Phil2040\\Desktop\\Word_count")

Upvotes: 1

Views: 3327

Answers (2)

André Laszlo
André Laszlo

Reputation: 15537

Use collections.Counter.

Example files:

/tmp/foo.txt

hello world
hello world
foo bar
foo bar baz

/tmp/bar.txt

hello world
hello world
foo bar
foo bar baz
foo foo foo

You can create one Counter per file, then add them together!

from collections import Counter

def word_count(filename):
    with open(filename, 'r') as f:
        c = Counter()
        for line in f:
            c.update(line.strip().split(' '))
        return c

files = ['/tmp/foo.txt', '/tmp/bar.txt']
counters = [word_count(filename) for filename in files]

# counters content (example):
# [Counter({'world': 2, 'foo': 2, 'bar': 2, 'hello': 2, 'baz': 1}),
#  Counter({'foo': 5, 'world': 2, 'bar': 2, 'hello': 2, 'baz': 1})]

# Add all the word counts together:
total = sum(counters, Counter())  # sum needs an empty counter to start with

# total content (example):
# Counter({'foo': 7, 'world': 4, 'bar': 4, 'hello': 4, 'baz': 2})

Upvotes: 3

volent
volent

Reputation: 471

def word_count_directory(directory):
    filelist=[os.path.join(directory,f) for f in os.listdir(directory)]
    return [build_dict(file) for file in filelist]

This will return a list of dictionary, one for each of your file.

If you want to get the wordcount of each file one after the other you can use a yield :

def word_count_directory(directory):
    filelist=[os.path.join(directory,f) for f in os.listdir(directory)]
    for file in filelist:
        yield build_dict(file)

word_count_directory(".") # gets the wordcount of the first file
word_count_directory(".") #   .         .       . the second file 

For your first function you should take a look at the Counter class from the collections module.

Upvotes: 1

Related Questions