getting top words from the tf-idf sparse matrix (highest tf-idf value)

Question

I have a list of size 208 (208 arrays of sentences), that looks like:

all_words = [["this is a sentence ... "] , [" another one hello bob this is alice ... "] , ["..."] ...]

I want to get the words with the highest tf-idf values. I created a tf-idf matrix:

from sklearn.feature_extraction.text import TfidfVectorizer

tokenize = lambda doc: doc.split(" ")
sklearn_tfidf = TfidfVectorizer(norm='l2', tokenizer=tokenize, ngram_range=(1,2))
tfidf_matrix = sklearn_tfidf.fit_transform(all_words)
sentences = sklearn_tfidf.get_feature_names()

dense_tfidf = tfidf_matrix.todense()

Now I don't know how to get the words with the highest tf-idf values.

Each column of the dense_tfidf represents a word/2-words. (the matrix is 208x5481)

When I summed each column, it didn't really help - got the same result of a simple top words (I guess because it's the same as a simple word count).

How can I get the words with the highest tf-idf value? Or how can I normalize it wisely?

getting top words from the tf-idf sparse matrix (highest tf-idf value)

Answers (1)

Related Questions