Reputation: 121
I'm new to python and programming and need your help.
I'm trying to count most common words in a text using nltk.word_tokenize
and Counter
. When I get the list of all elements of the text and I want to count all of them, Counter
counts only letters.
This is the code:
from nltk.tokenize import word_tokenize
word_counter = Counter()
test3 = "hello, hello, how are you? It's me - Boris"
words = word_tokenize(test3)
print(words)
['hello', ',', 'hello', ',', 'how', 'are', 'you', '?', 'It', "'s", 'me', '-', 'Boris']
for word in words:
word_counter.update(word)
print(word_counter)
The output:
Counter({'o': 5, 'e': 4, 'l': 4, 'h': 3, ',': 2, 'r': 2, 's': 2, 'w': 1, 'a': 1, 'y': 1, 'u': 1, '?': 1, 'I': 1, 't': 1, "'": 1, 'm': 1, '-': 1, 'B': 1, 'i': 1})
How could I solve that? I look through some topics, they solve it with text.split()
but it is not so precise as nltk
.
Thank you!
Upvotes: 1
Views: 218
Reputation: 73460
Just use Counter
as follows:
word_counter = Counter(words)
Counter.update
takes an iterable and updates the counts for the elements the iterable produces. In your loop, that would be the letters of the word (remember strings are iterables).
If you were to use update
, you could do:
word_counter = Counter()
# ...
words = word_tokenize(test3)
word_counter.update(words)
But there is no need to separate the initialization of the counter and the actual counting unless you want to repeat the second step for multiple lists of words.
Upvotes: 1