How to determine which words have high predictive power in Sentiment Analysis?

Question

I am working on a classification problem with Tweeter data. User labeled tweets (relevant, not relevant) are used to train a machine learning classifier to predict if an unseen tweet is relevant or not to the user.

I use a simple preprocessing techniques like removal of stopwords, stemming etc and a sklearn Tfidfvectorizer to convert the words into numbers before feeding them into a classifier e.g. SVM, kernel SVM , Naïve Bayes.

I would like to determine which words (features) have the higher predictive power. What is the best way to do so?

I have tried wordcloud but it just shows the words with highest frequency in the sample.

UPDATE:

The following approach along with sklearns feature_selection seem to provide the best answer so far to my problem:

top features Any other suggestions?

alove · Accepted Answer

Have you tried using tfidf? It creates a weighted matrix providing greater weight to the more semantically meaningful words of each text. It compares the individual text( in this case a tweet) to all of the texts (all of the tweets). It is much more helpful than using raw term counts for classification and other tasks. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

How to determine which words have high predictive power in Sentiment Analysis?

Answers (1)

Related Questions