I'm working on a problem that involves classifying a large database of texts. The texts are very short (think 3-8 words each) and there are 10-12 categories into which I wish to sort them. For the features, I'm simply using the tf–idf frequency of each word. Thus, the number of features is roughly equal to the number of words that appear overall in the texts (I'm removing stop words and some others). In trying to come up with a model to use, I've had the following two ideas: Naive Bayes (likely the sklearn multinomial Naive Bayes implementation) Support vector machine (with stochastic gradient descent used in training, also an sklearn implementation) I have built both models, and am currently comparing the results. What are the theoretical pros and cons to each model? Why might one of these be better for this type of problem? I'm new to machine learning, so what I'd like to understand is why one might do better. Many thanks!

machine-learningscikit-learntheorysupervised-learning

Reputation: 8241

Naive Bayes vs. SVM for classifying text data

I'm working on a problem that involves classifying a large database of texts. The texts are very short (think 3-8 words each) and there are 10-12 categories into which I wish to sort them. For the features, I'm simply using the tf–idf frequency of each word. Thus, the number of features is roughly equal to the number of words that appear overall in the texts (I'm removing stop words and some others).

In trying to come up with a model to use, I've had the following two ideas:

Naive Bayes (likely the sklearn multinomial Naive Bayes implementation)
Support vector machine (with stochastic gradient descent used in training, also an sklearn implementation)

I have built both models, and am currently comparing the results.

What are the theoretical pros and cons to each model? Why might one of these be better for this type of problem? I'm new to machine learning, so what I'd like to understand is why one might do better.

Many thanks!

Upvotes: 19

Answers (2)

Prakhar Agarwal

Reputation: 2852

Support Vector Machine (SVM) is better at full-length content.
Multinomial Naive Bayes (MNB) is better at snippets.

MNB is stronger for snippets than for longer documents. While (Ng and Jordan, 2002) showed that NB is better than SVM/logistic regression (LR) with few training cases, MNB is also better with short documents. SVM usually beats NB when it has more than 30–50 training cases, we show that MNB is still better on snippets even with relatively large training sets (9k cases).

Inshort, NBSVM seems to be an appropriate and very strong baseline for sophisticated classification text data.

Source Code: https://github.com/prakhar-agarwal/Naive-Bayes-SVM

Reference: http://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf

Cite: Wang, Sida, and Christopher D. Manning. "Baselines and bigrams: Simple, good sentiment and topic classification." Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, 2012.

Upvotes: 8

Horia Coman

Reputation: 8781

The biggest difference between the models you're building from a "features" point of view is that Naive Bayes treats them as independent, whereas SVM looks at the interactions between them to a certain degree, as long as you're using a non-linear kernel (Gaussian, rbf, poly etc.). So if you have interactions, and, given your problem, you most likely do, an SVM will be better at capturing those, hence better at the classification task you want.

The consensus for ML researchers and practitioners is that in almost all cases, the SVM is better than the Naive Bayes.

From a theoretical point of view, it is a little bit hard to compare the two methods. One is probabilistic in nature, while the second one is geometric. However, it's quite easy to come up with a function where one has dependencies between variables which are not captured by Naive Bayes (y(a,b) = ab), so we know it isn't an universal approximator. SVMs with the proper choice of Kernel are (as are 2/3 layer neural networks) though, so from that point of view, the theory matches the practice.

But in the end it comes down to performance on your problem - you basically want to choose the simplest method which will give good enough results for your problem and have a good enough performance. Spam detection has been famously solvable by just Naive Bayes, for example. Face recognition in images by a similar method enhanced with boosting etc.

Upvotes: 31

Naive Bayes vs. SVM for classifying text data

Answers (2)

Related Questions