Reputation: 1
What I wanted this code to do, is read a textfile, and print how much each word occures in a percent. It almost works..
I can't figure out how to sort the print out from highest occurrence to least (I had it at one point when I was copy/pasting other peoples code, I think I imported collections and counter, I have no idea)
But another problem is it reads through my whole list, which is fine with smaller textfiles, but larger ones just eat up my terminal, I'd like it to only print out words once, instead of once for each instance
name = raw_input('Enter file:')
handle = open(name, 'r')
text = handle.read()
words = text.split()
def percent(part, whole):
return 100 * float(part)/float(whole)
total = len(words)
counts = dict()
for word in words:
counts[word] = counts.get(word,0) + 1
print "\n"
print"Total Words\n", total
print"\n"
for word in words:
print word, percent(counts[word],total),"%"
Upvotes: 0
Views: 781
Reputation: 583
You can iterate through your dictionary like this:
for word in counts:
print word, counts[word]
This will print each key in the dictionary once.
For sorting, you should look at the built-in sorted()
function: https://docs.python.org/3.4/library/functions.html#sorted
Upvotes: 1
Reputation: 4806
Your code is very close to being workable; I just see a few issues that are causing your problems:
P1: Your code doesn't take non-word characters into account. For example, word;
, word.
, and word
would all be treated as unique words.
text = handle.read()
words = text.split()
P2: You iterate over the entire list of words, which includes duplicates, instead of your unique list in counts
. So of course you'll be printing each word multiple times.
for word in words:
P3: You open the file but never close it. Not exactly a problem with your code, but something to be improved. This is why using with open(...):
syntax is generally encouraged, as it handles closing the file for you.
handle = open(name, 'r')
Here's your code with some fixes:
#!/usr/bin/python
import re
name = raw_input('Enter file:')
def percent(part, whole):
return 100 * float(part)/float(whole)
# better way to open files, handles closing the file
with open(name, 'r') as handle:
text = handle.read()
words = text.split()
# get rid of non-word characters that are messing up count
formatted = []
for w in words:
formatted.extend(re.findall(r'\w+', w))
total = len(formatted)
counts = dict()
for word in formatted:
counts[word] = counts.get(word,0) + 1
print "\n"
print"Total Words\n", total
print"\n"
# iterate over the counts dict instead of the original word list
# this way each word is only printed once
for word,count in counts.iteritems():
print word, percent(counts[word],total),"%"
Output when run on this program:
Total Words
79
text 2.53164556962 %
float 2.53164556962 %
as 1.26582278481 %
file 1.26582278481 %
in 3.79746835443 %
handle 2.53164556962 %
counts 6.32911392405 %
total 3.79746835443 %
open 1.26582278481 %
findall 1.26582278481 %
for 3.79746835443 %
0 1.26582278481 %
percent 2.53164556962 %
formatted 5.06329113924 %
1 1.26582278481 %
re 2.53164556962 %
dict 1.26582278481 %
usr 1.26582278481 %
Words 1.26582278481 %
print 5.06329113924 %
import 1.26582278481 %
split 1.26582278481 %
bin 1.26582278481 %
return 1.26582278481 %
extend 1.26582278481 %
get 1.26582278481 %
python 1.26582278481 %
len 1.26582278481 %
iteritems 1.26582278481 %
part 2.53164556962 %
words 2.53164556962 %
Enter 1.26582278481 %
100 1.26582278481 %
with 1.26582278481 %
count 1.26582278481 %
word 7.59493670886 %
name 2.53164556962 %
read 1.26582278481 %
raw_input 1.26582278481 %
n 3.79746835443 %
r 1.26582278481 %
w 3.79746835443 %
Total 1.26582278481 %
whole 2.53164556962 %
def 1.26582278481 %
Breakdown of formatted.extend(re.findall(r'\w+', w))
:
1: the extend
function of lists takes a list and appends it the given list. For example:
listA = [1,2,3]
listB = [4,5,6]
listA.extend(listB)
print(listA)
# [1, 2, 3, 4, 5, 6]
2: re.findall(r'\w+', w))
This expression uses regular expressions to extract only the part of the string we care about. Here is a tutorial on python's regular expressions.
Basically, re.findall(x, y)
returns a list of all the substrings in y
that match the regular expression pattern outlined in x
. In our case, \w
means all word characters (i.e. alphanumeric chars), and the +
means one or more of the preceding pattern. So put together, \w+
means one or more word characters.
I probably made it somewhat confusing by naming the string variable we're doing the search on w
, but just keep in mind that the \w
in the pattern is not related to the w
variable that is the string.
word = 'heres some1; called s0mething!'
re.findall(r'\w+', word)
# ['heres', 'some1', 'called', 's0mething']
Upvotes: 0
Reputation: 26
For your first problem you can just Collections an OrderedDict as such:
sortedCounts = collections.OrderedDict(sorted(counts.items(),key=lambda t: t[1]))
To only print each word once:
for key, value in sortedCounts.iteritems():
print key, percent(value,total),"%"
Hope that helps
Upvotes: 0