Reputation: 5286
I have fitted a CountVectorizer
to some documents in scikit-learn
. I would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example
'and' 123 times, 'to' 100 times, 'for' 90 times, ... and so on
Is there any built-in function for this?
Upvotes: 14
Views: 12583
Reputation: 4190
There is no built-in. I have found a faster way to do it based on Ando Saabas's answer:
from sklearn.feature_extraction.text import CountVectorizer
texts = ["Hello world", "Python makes a better world"]
vec = CountVectorizer().fit(texts)
bag_of_words = vec.transform(texts)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
sorted(words_freq, key = lambda x: x[1], reverse=True)
output
[('world', 2), ('python', 1), ('hello', 1), ('better', 1), ('makes', 1)]
Upvotes: 5
Reputation: 363567
If cv
is your CountVectorizer
and X
is the vectorized corpus, then
zip(cv.get_feature_names(),
np.asarray(X.sum(axis=0)).ravel())
returns a list of (term, frequency)
pairs for each distinct term in the corpus that the CountVectorizer
extracted.
(The little asarray
+ ravel
dance is needed to work around some quirks in scipy.sparse
.)
Upvotes: 23