Reputation: 1

Counting how many times each word occures in a text

What I wanted this code to do, is read a textfile, and print how much each word occures in a percent. It almost works..

I can't figure out how to sort the print out from highest occurrence to least (I had it at one point when I was copy/pasting other peoples code, I think I imported collections and counter, I have no idea)

But another problem is it reads through my whole list, which is fine with smaller textfiles, but larger ones just eat up my terminal, I'd like it to only print out words once, instead of once for each instance

name = raw_input('Enter file:')
handle = open(name, 'r')
text = handle.read()
words = text.split()

def percent(part, whole):
   return 100 * float(part)/float(whole)

total = len(words)

counts = dict()
for word in words:
    counts[word] = counts.get(word,0) + 1

print "\n"
print"Total Words\n", total
print"\n"

for word in words:
  print word, percent(counts[word],total),"%"

Upvotes: 0

Answers (3)

akn320

Reputation: 583

You can iterate through your dictionary like this:

for word in counts:
    print word, counts[word]

This will print each key in the dictionary once. For sorting, you should look at the built-in sorted() function: https://docs.python.org/3.4/library/functions.html#sorted

Upvotes: 1

xgord

Reputation: 4806

Your code is very close to being workable; I just see a few issues that are causing your problems:

P1: Your code doesn't take non-word characters into account. For example, word;, word., and word would all be treated as unique words.

text = handle.read()
words = text.split()

P2: You iterate over the entire list of words, which includes duplicates, instead of your unique list in counts. So of course you'll be printing each word multiple times.

for word in words:

P3: You open the file but never close it. Not exactly a problem with your code, but something to be improved. This is why using with open(...): syntax is generally encouraged, as it handles closing the file for you.

handle = open(name, 'r')

Here's your code with some fixes:

#!/usr/bin/python

import re

name = raw_input('Enter file:')

def percent(part, whole):
   return 100 * float(part)/float(whole)

# better way to open files, handles closing the file
with open(name, 'r') as handle:
    text = handle.read()

words = text.split()

# get rid of non-word characters that are messing up count
formatted = []
for w in words:
    formatted.extend(re.findall(r'\w+', w))

total = len(formatted)

counts = dict()
for word in formatted:
    counts[word] = counts.get(word,0) + 1

print "\n"
print"Total Words\n", total
print"\n"

# iterate over the counts dict instead of the original word list
# this way each word is only printed once
for word,count in counts.iteritems():
    print word, percent(counts[word],total),"%"

Output when run on this program:

Total Words
79


text 2.53164556962 %
float 2.53164556962 %
as 1.26582278481 %
file 1.26582278481 %
in 3.79746835443 %
handle 2.53164556962 %
counts 6.32911392405 %
total 3.79746835443 %
open 1.26582278481 %
findall 1.26582278481 %
for 3.79746835443 %
0 1.26582278481 %
percent 2.53164556962 %
formatted 5.06329113924 %
1 1.26582278481 %
re 2.53164556962 %
dict 1.26582278481 %
usr 1.26582278481 %
Words 1.26582278481 %
print 5.06329113924 %
import 1.26582278481 %
split 1.26582278481 %
bin 1.26582278481 %
return 1.26582278481 %
extend 1.26582278481 %
get 1.26582278481 %
python 1.26582278481 %
len 1.26582278481 %
iteritems 1.26582278481 %
part 2.53164556962 %
words 2.53164556962 %
Enter 1.26582278481 %
100 1.26582278481 %
with 1.26582278481 %
count 1.26582278481 %
word 7.59493670886 %
name 2.53164556962 %
read 1.26582278481 %
raw_input 1.26582278481 %
n 3.79746835443 %
r 1.26582278481 %
w 3.79746835443 %
Total 1.26582278481 %
whole 2.53164556962 %
def 1.26582278481 %

Edit -- added explanation of the word formatting

Breakdown of formatted.extend(re.findall(r'\w+', w)):

1: the extend function of lists takes a list and appends it the given list. For example:

listA = [1,2,3]
listB = [4,5,6]
listA.extend(listB)
print(listA)
# [1, 2, 3, 4, 5, 6]

2: re.findall(r'\w+', w))

This expression uses regular expressions to extract only the part of the string we care about. Here is a tutorial on python's regular expressions.

Basically, re.findall(x, y) returns a list of all the substrings in y that match the regular expression pattern outlined in x. In our case, \w means all word characters (i.e. alphanumeric chars), and the + means one or more of the preceding pattern. So put together, \w+ means one or more word characters.

I probably made it somewhat confusing by naming the string variable we're doing the search on w, but just keep in mind that the \w in the pattern is not related to the w variable that is the string.

word = 'heres some1; called s0mething!'
re.findall(r'\w+', word)
# ['heres', 'some1', 'called', 's0mething']

Upvotes: 0

Marque Boseman

Reputation: 26

For your first problem you can just Collections an OrderedDict as such:

sortedCounts = collections.OrderedDict(sorted(counts.items(),key=lambda t: t[1]))

To only print each word once:

for key, value in sortedCounts.iteritems():
  print key, percent(value,total),"%"

Hope that helps

Upvotes: 0

Counting how many times each word occures in a text

Answers (3)

Edit -- added explanation of the word formatting

Related Questions