Reputation:
I am creating a code that allows the user to input a .txt file of their choice. So, for example, if the text read:
"I am you. You ArE I."
I would like my code to create a dictionary that resembles this:
{I: 2, am: 1, you: 2, are: 1}
Having the words in the file appear as the key, and the number of times as the value. Capitalization should be irrelevant, so are = ARE = ArE = arE = etc...
This is my code so far. Any suggestions/help?
>> file = input("\n Please select a file")
>> name = open(file, 'r')
>> dictionary = {}
>> with name:
>> for line in name:
>> (key, val) = line.split()
>> dictionary[int(key)] = val
Upvotes: 2
Views: 115
Reputation: 76695
Take a look at the examples in this answer:
Python : List of dict, if exists increment a dict value, if not append a new dict
You can use collections.Counter()
to trivially do what you want, but if for some reason you can't use that, you can use a defaultdict
or even a simple loop to build the dictionary you want.
Here is code that solves your problem. This will work in Python 3.1 and newer.
from collections import Counter
import string
def filter_punctuation(s):
return ''.join(ch if ch not in string.punctuation else ' ' for ch in s)
def lower_case_words(f):
for line in f:
line = filter_punctuation(line)
for word in line.split():
yield word.lower()
def count_key(tup):
"""
key function to make a count dictionary sort into descending order
by count, then case-insensitive word order when counts are the same.
tup must be a tuple in the form: (word, count)
"""
word, count = tup
return (-count, word.lower())
dictionary = {}
fname = input("\nPlease enter a file name: ")
with open(fname, "rt") as f:
dictionary = Counter(lower_case_words(f))
print(sorted(dictionary.items(), key=count_key))
From your example I could see that you wanted punctuation stripped away. Since we are going to split the string on white space, I wrote a function that filters punctuation to white space. That way, if you have a string like hello,world
this will be split into the words hello
and world
when we split on white space.
The function lower_case_words()
is a generator, and it reads an input file one line at a time and then yields up one word at a time from each line. This neatly puts our input processing into a tidy "black box" and later we can simply call Counter(lower_case_words(f))
and it does the right thing for us.
Of course you don't have to print the dictionary sorted, but I think it looks better this way. I made the sort order put the highest counts first, and where counts are equal, put the words in alphabetical order.
With your suggested input, this is the resulting output:
[('i', 2), ('you', 2), ('am', 1), ('are', 1)]
Because of the sorting it always prints in the above order.
Upvotes: 1