Reputation: 843
I am working on a classification problem with Tweeter data. User labeled tweets (relevant, not relevant) are used to train a machine learning classifier to predict if an unseen tweet is relevant or not to the user.
I use a simple preprocessing techniques like removal of stopwords, stemming etc and a sklearn Tfidfvectorizer to convert the words into numbers before feeding them into a classifier e.g. SVM, kernel SVM , Naïve Bayes.
I would like to determine which words (features) have the higher predictive power. What is the best way to do so?
I have tried wordcloud but it just shows the words with highest frequency in the sample.
UPDATE:
The following approach along with sklearns feature_selection seem to provide the best answer so far to my problem:
top features Any other suggestions?
Upvotes: 1
Views: 252
Reputation: 36
Have you tried using tfidf? It creates a weighted matrix providing greater weight to the more semantically meaningful words of each text. It compares the individual text( in this case a tweet) to all of the texts (all of the tweets). It is much more helpful than using raw term counts for classification and other tasks. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
Upvotes: 2