how to get the most representative features in the following tfidf model?

Question

Hello I have the following list:

listComments = ["comment1","comment2","comment3",...,"commentN"]

I created a tfidf vectorizer to get a model from my comments as follows:

tfidf_vectorizer = TfidfVectorizer(min_df=10,ngram_range=(1,3),analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(listComments)

Now in order to undestand more about my model I would like to get the most representative features, I tried:

print("these are the features :",tfidf_vectorizer.get_feature_names())
print("the vocabulary :",tfidf_vectorizer.vocabulary_)

and this is giving me a list of words that I think that my model is using for the vectorization:

these are the features : ['10', '10 days', 'red', 'car',...]

the vocabulary : {'edge': 86, 'local': 96, 'machine': 2,...}

However I would like to find a way to get the 30 most representative features, I mean the words that achieves the highest values in my tfidf model, the words with highest inverse frecuency, I was Reading in the documentation but I was not able to find this method I really appreciate help with this issue, thanks in advance,

Ted Petrou · Accepted Answer

If you want to get a list of the vocabulary with respect to idf scores you can use the idf_ attribute and argsort it.

# create an array of feature names
feature_names = np.array(tfidf_vectorizer.get_feature_names())

# get order
idf_order = tfidf_vectorizer.idf_.argsort()[::-1]

# produce sorted idf word
feature_names[idf_order]

If you would like to get a sorted list of tfidf scores for each document you would do a similar thing.

# get order for all documents based on tfidf scores
tfidf_order = tfidf.toarray().argsort()[::-1]

# produce words
feature_names[tfidf_order]

how to get the most representative features in the following tfidf model?

Answers (1)

Related Questions