Reputation: 367
I am trying to build a simple program which takes a text file, builds a dict()
with the words as keys, and the values as the number of times each word appears (word frequency).
I've learned that the collections.Counter
function can do this easily (among other methods). My problem is that, I'd like the dictionary to be ordered by the frequency so that I can print the Nth most frequent words. Finally, I also need to have a way for the dictionary to later associate a value of a different type (string of the definition of the word).
Basically I need something that outputs this:
Number of words: 5
[mostfrequentword: frequency, definition]
[2ndmostfrequentword: frequency, definition]
etc.
This is what I have so far, but it only counts the word frequency, I don't know how to order the dictionary by the frequency and then print the Nth most frequent words:
wordlist ={}
def cleanedup(string):
alphabet = 'abcdefghijklmnopqrstuvwxyz'
cleantext = ''
for character in string.lower():
if character in alphabet:
cleantext += character
else:
cleantext += ' '
return cleantext
def text_crunch(textfile):
for line in textfile:
for word in cleanedup(line).split():
if word in wordlist:
wordlist[word] += 1
else:
wordlist[word] = 1
with open ('DQ.txt') as doc:
text_crunch(doc)
print(wordlist['todos'])
Upvotes: 0
Views: 522
Reputation: 80011
A simpler version of your code that does pretty much what you want :)
import string
import collections
def cleanedup(fh):
for line in fh:
word = ''
for character in line:
if character in string.ascii_letters:
word += character
elif word:
yield word
word = ''
with open ('DQ.txt') as doc:
wordlist = collections.Counter(cleanedup(doc))
print wordlist.most_commond(5)
Alternative solutions with regular expressions:
import re
import collections
def cleandup(fh):
for line in fh:
for word in re.findall('[a-z]+', line.lower()):
yield word
with open ('DQ.txt') as doc:
wordlist = collections.Counter(cleanedup(doc))
print wordlist.most_commond(5)
Or:
import re
import collections
def cleandup(fh):
for line in fh:
for word in re.split('[^a-z]+', line.lower()):
yield word
with open ('DQ.txt') as doc:
wordlist = collections.Counter(cleanedup(doc))
print wordlist.most_commond(5)
Upvotes: 1