nubby
nubby

Reputation: 23

How do you count occurrences in a list in Python?

I'm new to python and I want to Count the number of times each word occurs across all the files. Display each word, the number of times it occurred and the percentage of time it occurred. Sort the list so the most frequent word appears first, and the least frequent word appears last. I'm working on small sample right know just one file but I can't get to work right,

 from collections import defaultdict

words = "apple banana apple strawberry banana lemon"

d = defaultdict(int)
for word in words.split():
    d[word] += 1

Upvotes: 2

Views: 1502

Answers (4)

quizdog
quizdog

Reputation: 662

As recommended above, the Counter class from the collections module is definitely the way to go for counting applications.

This solution also addresses the request to count words in multiple files using the fileinput.input() method to iterate over the contents of all the filenames specified on the command line (or if no filenames specified on the command line then will read from STDIN, typically the keyboard)

Finally it uses a little more sophisticated approach for breaking the line into 'words' with a regular expression as a delimiter. As noted in the code it will handle contractions more gracefully (however it will be confused by apostrophes being used a single quotes)

"""countwords.py
   count all words across all files
"""

import fileinput
import re
import collections

# create a regex delimiter that is any character that is  not 1 or
# more word character or an apostrophe, this allows contractions
# to be treated as a word (eg can't  won't  didn't )
# Caution: this WILL get confused by a line that uses apostrophe
# as a single quote: eg 'hello' would be treated as a 7 letter word

word_delimiter = re.compile(r"[^\w']+")

# create an empty Counter

counter = collections.Counter()

# use fileinput.input() to open and read ALL lines from ALL files
# specified on the command line, or if no files specified on the
# command line then read from STDIN (ie the keyboard or redirect)

for line in fileinput.input():
    for word in word_delimiter.split(line):
        counter[word.lower()] += 1   # count case insensitively

del counter['']   # handle corner case of the occasional 'empty' word

# compute the total number of words using .values() to get an
# generator of all the Counter values (ie the individual word counts)        
# then pass that generator to the sum function which is able to 
# work with a list or a generator

total = sum(counter.values())

# iterate through the key/value pairs (ie word/word_count) in sorted
# order - the lambda function says sort based on position 1 of each
# word/word_count tuple (ie the word_count) and reverse=True does
# exactly what it says = reverse the normal order so it now goes
# from highest word_count to lowest word_count

print("{:>10s}  {:>8s} {:s}".format("occurs", "percent", "word"))

for word, count in sorted(counter.items(),
                          key=lambda t: t[1],
                          reverse=True):
    print ("{:10d} {:8.2f}% {:s}".format(count, count/total*100, word))

Example output:

$ python3 countwords.py
I have a dog, he is a good dog, but he can't fly
^D

occurs   percent word
     2    15.38% a
     2    15.38% dog
     2    15.38% he
     1     7.69% i
     1     7.69% have
     1     7.69% is
     1     7.69% good
     1     7.69% but
     1     7.69% can't
     1     7.69% fly

And:

$ python3 countwords.py text1 text2
    occurs   percent word
         2    11.11% hello
         2    11.11% i
         1     5.56% there
         1     5.56% how
         1     5.56% are
         1     5.56% you
         1     5.56% am
         1     5.56% fine
         1     5.56% mark
         1     5.56% where
         1     5.56% is
         1     5.56% the
         1     5.56% dog
         1     5.56% haven't
         1     5.56% seen
         1     5.56% him

Upvotes: 2

Yaakov Bressler
Yaakov Bressler

Reputation: 12018

Using your code, here's a neater approach:

# Initializing Dictionary
d = {}
with open(sys.argv[1], 'r') as f:

    # counting number of times each word comes up in list of words (in dictionary)
    for line in f: 
        words = line.lower().split() 
        # Iterate over each word in line 
        for word in words: 
            if word not in d.keys():
                d[word] = 1
            else:
                d[word]+=1

n_all_words = sum([k.values])

# Print percentage occurance
for k, v in d.items():
    print(f'{k} occurs {v} times and is {(100*v/n_all_words):,.2f}% total of words.')


# Sort a dictionary using this useful solution
# https://stackoverflow.com/a/613218/10521959
import operator
sorted_d = sorted(d.items(), key=operator.itemgetter(1))

Upvotes: 1

Derek Eden
Derek Eden

Reputation: 4618

the most straightforward way to do this is just using the Counter function:

from collections import Counter
c = Counter(words.split())

output:

Counter({'apple': 2, 'banana': 2, 'strawberry': 1, 'lemon': 1})

to just get the words in order, or the counts:

list(c.keys())
list(c.values())

or put it into a normal dict:

dict(c.items())

or list of tuples:

c.most_common()

Upvotes: 0

Cireo
Cireo

Reputation: 4427

As mentioned in the comments, this is precisely collections.Counter

words = 'a b c a'.split()
print(Counter(words).most_common())

From docs: https://docs.python.org/2/library/collections.html

most_common([n])
Return a list of the n most common elements and their counts
from the most common to the least. If n is omitted or None,
most_common() returns all elements in the counter.
Elements with equal counts are ordered arbitrarily:

>>> Counter('abracadabra').most_common(3)
[('a', 5), ('r', 2), ('b', 2)]

Upvotes: 1

Related Questions