Reputation: 125
I'm a complete newbie in Machine Learning, NLP, Data Analysis but I'm very motivated to understand it better. I'm reading couple of books on NLTK, scikit-learn etc. I discovered a python module "TextBlob" and found it to be super easy to get started with it. Hence I have created a sample demo python script which is hosted at: https://gist.github.com/dpnishant/367cef57a8033138eb0a. I'm trying to figure out the best suited algorithm for sentiment analysis and text classification. My questions are as follows:
Why is the sentiment analysis in the NaiveBayesClassifier slow even on such a small training set? Is this time constant or is it going to increase even more with more training data? And also the sentiment analysis is incorrect (refer the script output, it says "negative" for the input text "sandwich is good"). What am I doing wrong?
I read in the TextBlob's documentation that the NaiveBayesClassifier is trained on the movie_review corpus. Is there any api where I can change it to something else, nps_chat maybe? Something that is not very clear to me is what is the role of a corpus? I mean, we are training the classifier with our own sample training data then how would more specific corpus e.g. nps_chat, product_reviews, moview_review etc. would help?
I understand that I need to train a classifier for it to work on a unlabelled data. But if the training data gets huge, what is the best way to handle it? Should the program build the model from the training data every time or is there way where we can save the model to a file (something like pickle) and read it from there? Is it possible with TextBlob and will there be any performance improvements with this methodology?
In my script, in the last block I'm trying to evaluate the SklearnClassifier via the NLTKClassifier module but I'm having no luck there. It throws some cryptic error messages. Can you please help me in resolving it? And also may I request you to, if possible, show some examples regarding the usage of algorithms/classifiers available in the nltk.classify package on the TextBlob's documentation website e.g. the Megam, LogisticRegression, SVM, BernoulliNB, GaussianNB etc. An use-case for understanding the applicability of the each algorithm would clear a lot of doubts in beginners like me.
Upvotes: 4
Views: 2022
Reputation: 1941
The Naive Bayes classifier (NBC) is a simple algorithm and has a low time complexity and in practice runs quickly. If you get slow results on a small data set it seems it is due to a different place. I suspect it is due to the TextBlob object, which is an overkill for short texts. Try replacing the NBC with a different algorithm like a decision tree to see if it is indeed the one to blame.
A classifier should be trained on a data that represents the data on which it will be tested. Though sentiment might have similarities between the movies reviews and your data set, it is unneeded assumption and a possible source of problem. Sometimes people are using pre training on other dataset in the case of lack of labeled data. In this case you should check for domain adaptation issues.
Usually you train the model once, sae it and use it. If the data set is likely to change (a in the case of concept drift) retraining is needed. It seems that you will benefit from moving from TextBlob to scikit-learn which also enable saving models.
Upvotes: 2