Dider
Dider

Reputation: 367

Python program: foreign language word-frequency dictionary

I am trying to build a simple program which takes a text file, builds a dict() with the words as keys, and the values as the number of times each word appears (word frequency).

I've learned that the collections.Counter function can do this easily (among other methods). My problem is that, I'd like the dictionary to be ordered by the frequency so that I can print the Nth most frequent words. Finally, I also need to have a way for the dictionary to later associate a value of a different type (string of the definition of the word).

Basically I need something that outputs this:

Number of words: 5
[mostfrequentword: frequency, definition]
[2ndmostfrequentword: frequency, definition]
etc.   

This is what I have so far, but it only counts the word frequency, I don't know how to order the dictionary by the frequency and then print the Nth most frequent words:

wordlist ={}

def cleanedup(string):
    alphabet = 'abcdefghijklmnopqrstuvwxyz'
    cleantext = ''
    for character in string.lower():
        if character in alphabet:
            cleantext += character
        else:
            cleantext += ' '
    return cleantext

def text_crunch(textfile):
       for line in textfile:
            for word in cleanedup(line).split():
                if word in wordlist:
                    wordlist[word] += 1
                else:
                    wordlist[word] = 1


with open ('DQ.txt') as doc:
    text_crunch(doc)
    print(wordlist['todos'])

Upvotes: 0

Views: 522

Answers (1)

Wolph
Wolph

Reputation: 80011

A simpler version of your code that does pretty much what you want :)

import string
import collections

def cleanedup(fh):
    for line in fh:
        word = ''
        for character in line:
            if character in string.ascii_letters:
                word += character
            elif word:
                yield word
                word = ''

with open ('DQ.txt') as doc:
    wordlist = collections.Counter(cleanedup(doc))
    print wordlist.most_commond(5)

Alternative solutions with regular expressions:

import re
import collections

def cleandup(fh):
    for line in fh:
        for word in re.findall('[a-z]+', line.lower()):
            yield word

with open ('DQ.txt') as doc:
    wordlist = collections.Counter(cleanedup(doc))
    print wordlist.most_commond(5)

Or:

import re
import collections

def cleandup(fh):
    for line in fh:
        for word in re.split('[^a-z]+', line.lower()):
            yield word

with open ('DQ.txt') as doc:
    wordlist = collections.Counter(cleanedup(doc))
    print wordlist.most_commond(5)

Upvotes: 1

Related Questions