Reputation: 2432
I have a small python script that prints the 10 most frequent words of a text document (with each word being 2 letters or more) and I need to continue the script to print the 10 most INfrequent words in the document as well. I have a script that is relatively working, except the 10 most infrequent words it prints are numbers (integers and floaters) when they should be words. How can I iterate ONLY words and exclude the numbers? Here is my full script:
# Most Frequent Words:
from string import punctuation
from collections import defaultdict
number = 10
words = {}
with open("charactermask.txt") as txt_file:
words = [x.strip(punctuation).lower() for x in txt_file.read().split()]
counter = defaultdict(int)
for word in words:
if len(word) >= 2:
counter[word] += 1
top_words = sorted(counter.iteritems(),
key=lambda(word, count): (-count, word))[:number]
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
# Least Frequent Words:
least_words = sorted(counter.iteritems(),
key=lambda (word, count): (count, word))[:number]
for word, frequency in least_words:
print "%s: %d" % (word, frequency)
EDIT: The end of the document (the part under the # Least Frequent Words
comment) is the part that needs fixing.
Upvotes: 0
Views: 635
Reputation: 9407
You need a function, letters_only()
, which will run a regular expression matching [0-9]
and, if any matches are found, return False. Something like this::
def letters_only(word):
return re.search(r'[0-9]', word) is None
Then, where you say for word in words
, instead say for word in filter(letters_only, words)
.
Upvotes: 1
Reputation: 140669
You're going to need a filter -- change the regex to match however you want to define a "word":
import re
alphaonly = re.compile(r"^[a-z]{2,}$")
Now, do you want the word frequency table to not include numbers in the first place?
counter = defaultdict(int)
with open("charactermask.txt") as txt_file:
for line in txt_file:
for word in line.strip().split():
word = word.strip(punctuation).lower()
if alphaonly.match(word):
counter[word] += 1
Or do you just want to skip over the numbers when extracting the least-frequent words from the table?
words_by_freq = sorted(counter.iteritems(),
key=lambda(word, count): (count, word))
i = 0
for word, frequency in words_by_freq:
if alphaonly.match(word):
i += 1
sys.stdout.write("{}: {}\n".format(word, frequency))
if i == number: break
Upvotes: 1