Ty Bailey
Ty Bailey

Reputation: 2432

Print 10 most infrequent words of a text document using python

I have a small python script that prints the 10 most frequent words of a text document (with each word being 2 letters or more) and I need to continue the script to print the 10 most INfrequent words in the document as well. I have a script that is relatively working, except the 10 most infrequent words it prints are numbers (integers and floaters) when they should be words. How can I iterate ONLY words and exclude the numbers? Here is my full script:

# Most Frequent Words:
from string import punctuation
from collections import defaultdict

number = 10
words = {}

with open("charactermask.txt") as txt_file:
    words = [x.strip(punctuation).lower() for x in txt_file.read().split()]

counter = defaultdict(int)

for word in words:
  if len(word) >= 2:
    counter[word] += 1

top_words = sorted(counter.iteritems(),
                    key=lambda(word, count): (-count, word))[:number] 

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)


# Least Frequent Words:
least_words = sorted(counter.iteritems(),
                    key=lambda (word, count): (count, word))[:number]

for word, frequency in least_words:
    print "%s: %d" % (word, frequency)

EDIT: The end of the document (the part under the # Least Frequent Words comment) is the part that needs fixing.

Upvotes: 0

Views: 635

Answers (2)

asthasr
asthasr

Reputation: 9407

You need a function, letters_only(), which will run a regular expression matching [0-9] and, if any matches are found, return False. Something like this::

def letters_only(word):
    return re.search(r'[0-9]', word) is None

Then, where you say for word in words, instead say for word in filter(letters_only, words).

Upvotes: 1

zwol
zwol

Reputation: 140669

You're going to need a filter -- change the regex to match however you want to define a "word":

import re
alphaonly = re.compile(r"^[a-z]{2,}$")

Now, do you want the word frequency table to not include numbers in the first place?

counter = defaultdict(int)

with open("charactermask.txt") as txt_file:
    for line in txt_file:
        for word in line.strip().split():
          word = word.strip(punctuation).lower()
          if alphaonly.match(word):
              counter[word] += 1

Or do you just want to skip over the numbers when extracting the least-frequent words from the table?

words_by_freq = sorted(counter.iteritems(),
                       key=lambda(word, count): (count, word))

i = 0
for word, frequency in words_by_freq:
    if alphaonly.match(word):
        i += 1
        sys.stdout.write("{}: {}\n".format(word, frequency))
    if i == number: break

Upvotes: 1

Related Questions