Reputation: 23
I'm new to python and I want to Count the number of times each word occurs across all the files. Display each word, the number of times it occurred and the percentage of time it occurred. Sort the list so the most frequent word appears first, and the least frequent word appears last. I'm working on small sample right know just one file but I can't get to work right,
from collections import defaultdict
words = "apple banana apple strawberry banana lemon"
d = defaultdict(int)
for word in words.split():
d[word] += 1
Upvotes: 2
Views: 1502
Reputation: 662
As recommended above, the Counter
class from the collections
module is definitely the way to go for counting applications.
This solution also addresses the request to count words in multiple files using the fileinput.input()
method to iterate over the contents of all the filenames specified on the command line (or if no filenames specified on the command line then will read from STDIN
, typically the keyboard)
Finally it uses a little more sophisticated approach for breaking the line into 'words' with a regular expression as a delimiter. As noted in the code it will handle contractions more gracefully (however it will be confused by apostrophes being used a single quotes)
"""countwords.py
count all words across all files
"""
import fileinput
import re
import collections
# create a regex delimiter that is any character that is not 1 or
# more word character or an apostrophe, this allows contractions
# to be treated as a word (eg can't won't didn't )
# Caution: this WILL get confused by a line that uses apostrophe
# as a single quote: eg 'hello' would be treated as a 7 letter word
word_delimiter = re.compile(r"[^\w']+")
# create an empty Counter
counter = collections.Counter()
# use fileinput.input() to open and read ALL lines from ALL files
# specified on the command line, or if no files specified on the
# command line then read from STDIN (ie the keyboard or redirect)
for line in fileinput.input():
for word in word_delimiter.split(line):
counter[word.lower()] += 1 # count case insensitively
del counter[''] # handle corner case of the occasional 'empty' word
# compute the total number of words using .values() to get an
# generator of all the Counter values (ie the individual word counts)
# then pass that generator to the sum function which is able to
# work with a list or a generator
total = sum(counter.values())
# iterate through the key/value pairs (ie word/word_count) in sorted
# order - the lambda function says sort based on position 1 of each
# word/word_count tuple (ie the word_count) and reverse=True does
# exactly what it says = reverse the normal order so it now goes
# from highest word_count to lowest word_count
print("{:>10s} {:>8s} {:s}".format("occurs", "percent", "word"))
for word, count in sorted(counter.items(),
key=lambda t: t[1],
reverse=True):
print ("{:10d} {:8.2f}% {:s}".format(count, count/total*100, word))
Example output:
$ python3 countwords.py
I have a dog, he is a good dog, but he can't fly
^D
occurs percent word
2 15.38% a
2 15.38% dog
2 15.38% he
1 7.69% i
1 7.69% have
1 7.69% is
1 7.69% good
1 7.69% but
1 7.69% can't
1 7.69% fly
And:
$ python3 countwords.py text1 text2
occurs percent word
2 11.11% hello
2 11.11% i
1 5.56% there
1 5.56% how
1 5.56% are
1 5.56% you
1 5.56% am
1 5.56% fine
1 5.56% mark
1 5.56% where
1 5.56% is
1 5.56% the
1 5.56% dog
1 5.56% haven't
1 5.56% seen
1 5.56% him
Upvotes: 2
Reputation: 12018
Using your code, here's a neater approach:
# Initializing Dictionary
d = {}
with open(sys.argv[1], 'r') as f:
# counting number of times each word comes up in list of words (in dictionary)
for line in f:
words = line.lower().split()
# Iterate over each word in line
for word in words:
if word not in d.keys():
d[word] = 1
else:
d[word]+=1
n_all_words = sum([k.values])
# Print percentage occurance
for k, v in d.items():
print(f'{k} occurs {v} times and is {(100*v/n_all_words):,.2f}% total of words.')
# Sort a dictionary using this useful solution
# https://stackoverflow.com/a/613218/10521959
import operator
sorted_d = sorted(d.items(), key=operator.itemgetter(1))
Upvotes: 1
Reputation: 4618
the most straightforward way to do this is just using the Counter function:
from collections import Counter
c = Counter(words.split())
output:
Counter({'apple': 2, 'banana': 2, 'strawberry': 1, 'lemon': 1})
to just get the words in order, or the counts:
list(c.keys())
list(c.values())
or put it into a normal dict:
dict(c.items())
or list of tuples:
c.most_common()
Upvotes: 0
Reputation: 4427
As mentioned in the comments, this is precisely collections.Counter
words = 'a b c a'.split()
print(Counter(words).most_common())
From docs: https://docs.python.org/2/library/collections.html
most_common([n])
Return a list of the n most common elements and their counts
from the most common to the least. If n is omitted or None,
most_common() returns all elements in the counter.
Elements with equal counts are ordered arbitrarily:
>>> Counter('abracadabra').most_common(3)
[('a', 5), ('r', 2), ('b', 2)]
Upvotes: 1