Reputation: 21
My program opens a file and it can word count the words contained in it but i want to create a dictionary consisting of all the unique words in the text for example if the word 'computer' appears three times i want that to count as one unique word
def main():
file = input('Enter the name of the input file: ')
infile = open(file, 'r')
file_contents = infile.read()
infile.close()
words = file_contents.split()
number_of_words = len(words)
print("There are", number_of_words, "words contained in this paragarph")
main()
Upvotes: 2
Views: 95
Reputation: 10328
Use a set. This will only include unique words:
words = set(words)
If you don't care about case, you can do this:
words = set(word.lower() for word in words)
This assumes there is no punctuation. If there is, you will need to strip the punctuation.
import string
words = set(word.lower().strip(string.punctuation) for word in words)
If you need to keep track of how many of each word you have, just replace set
with Counter
in the examples above:
import string
from collections import Counter
words = Counter(word.lower().strip(string.punctuation) for word in words)
This will give you a dictionary-like object that tells you how many of each word there is.
You can also get the number of unique words from this (although it is slower if that is all you care about):
import string
from collections import Counter
words = Counter(word.lower().strip(string.punctuation) for word in words)
nword = len(words)
Upvotes: 2
Reputation: 5651
Probably more cleaner and quick solution:
words_dict = {}
for word in words:
word_count = words_dict.get(word, 0)
words_dict[word] = word_count + 1
Upvotes: 0
Reputation: 366
@TheBlackCat his solution works but only gives you how much unique words are in the string/file. This solution also shows you how many times it occurs.
dictionaryName = {}
for word in words:
if word not in list(dictionaryName):
dictionaryName[word] = 1
else:
number = dictionaryName.get(word)
dictionaryName[word] = dictionaryName.get(word) + 1
print dictionaryName
tested with:
words = "Foo", "Bar", "Baz", "Baz"
output: {'Foo': 1, 'Bar': 1, 'Baz': 2}
Upvotes: 0