user4069366
user4069366

Reputation:

How to improve classification of small texts

The data that I've got are mostly tweets or small comments (300-400 chars). I used a Bag-Of-Word model and used NaiveBayes classification. Now I'm having a lot of misclassified cases which are of the type mentioned below :-

1.] He sucked on a lemon early morning to get rid of hangover.
2.] That movie sucked big time.

Now the problem is that during sentiment classification both are getting "Negative" just because of the word "sucked"

Sentiment Classification : 1.] Negative 2.] Negative

Similarly during document classification both are getting classified into "movies" due to the presence of word "sucked".

Document classification  : 1.] Movie    2.] Movie

This is just one of such instances, I'm facing a huge number of wrong classifications and don't have any idea on how to improve the accuracy.

Upvotes: 2

Views: 1175

Answers (2)

sy2
sy2

Reputation: 51

(1) One straightforward possible change from Bag-of-Words with Naive Bayes is to generate polynomial combinations of Bag-of-Words features. It might solve the problems you have shown above.

"sucked" + "lemon" (positive)

"sucked" + "movie" (negative)

Of course, you can also generate polynomial combinations of n-grams but the number of features might be too large.

The scikit-learn library prepares a preprocessing class for the purpose.

sklearn.preprocessing.PolynomialFeatures (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)

Theoretically, SVM with the polynomial kernel does the same thing as PolynomialFeatures + linear SVM but slightly different regarding how you store the model information.

In my experience, PolynomialFeatures + linear SVM performs reasonably well for short text classification including sentiment analysis.

If the dataset size is not large enough, the training dataset might not contain "sucked" + "lemon". In the case, dimensionality reduction such as Singular Value Decomposition (SVD) and topic models such as Latent Dirichlet Allocation (LDA) are suitable tools to semantic clusters for words.

(2) Another direction is to utilize more sophisticated natural language processing (NLP) techniques to extract additional information from short texts. For example, Part-of-Speech (POS) tagging, Named Entity Recognition (NER) will give more information than plain BoWs. A python library for NLP called Natural Language Toolkit (NLTK) implements those functions.

(3) You can also take slow but steady way. Analyzing errors in prediction by the current model to design new hand-crafted features is a promising way to improve the accuracy of the model.

There is a library for short-text classification called LibShortText, which also contains an error analysis function and preprocessing functions such as TF-IDF weighting. It might help you to learn how to improve the model via error analysis.

LibShortText (https://www.csie.ntu.edu.tw/~cjlin/libshorttext/)

(4) For further information, take a look at the literature on sentiment analysis of Tweets will give you more advanced information.

Upvotes: 3

hoaphumanoid
hoaphumanoid

Reputation: 997

Maybe you could try to use a more powerful classifier like Support Vector Machines. Also depending on the amount of data you have, you could try deep learning with convolutional neural nets. For this you will need a huge number of training examples (100k-1M).

Upvotes: 0

Related Questions