Reputation: 89
For years I've been using my own Bayesian-like methods to categorize new items from external sources based on a large and continually updated training dataset.
There are three types of categorization done for each item:
Each item consists of English text of around 2,000 characters. In my training dataset there are about 265,000 items, which contain a rough estimate of 10,000,000 features (unique three word phrases).
My homebrew methods have been fairly successful, but definitely have room for improvement. I've read the NLTK book's chapter "Learning to Classify Text", which was great and gave me a good overview of NLP classification techniques. I'd like to be able to experiment with different methods and parameters until I get the best classification results possible for my data.
What off-the-shelf NLP tools are available that can efficiently classify such a large dataset?
Those I've tried so far:
I tried to train them with a dataset that consisted of less than 1% of the available training data: 1,700 items, 375,000 features. For NLTK I used a sparse binary format, and a similarly compact format for TIMBL.
Both seemed to rely on doing everything in memory, and quickly consumed all system memory. I can get them to work with tiny datasets, but nothing large. I suspect that if I tried incrementally adding the training data the same problem would occur either then or when doing the actual classification.
I've looked at Google's Prediction API, which seem to do much of what I'm looking for but not everything. I'd also like to avoid relying on an external service if possible.
About the choice of features: in testing with my homebrew methods over the years, three word phrases produced by far the best results. Although I could reduce the number of features by using words or two word phrases, that would most likely produce inferior results and would still be a large number of features.
Upvotes: 8
Views: 2372
Reputation: 91
I would recommend Mahout as it is intended for handling very large scale data sets. The ML algorithms are built over Apache Hadoop(map/reduce), so scaling is inherent.
Take a look at classification section below and see if it helps. https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
Upvotes: 2
Reputation: 7121
After this post and based on the personal experience, I would recommend Vowpal Wabbit. It is said to have one of the fastest text classification algorithms.
Upvotes: 3
Reputation: 5082
MALLET has a number of classifiers (NB, MaxEnt, CRF, etc). It's written Andrew McCallum's group. SVMLib is another good option, but SVM models typically require a bit more tuning than MaxEnt. Alternatively some sort of online clustering like K-means might not be bad in this case.
SVMLib and MALLET are quite fast (C and Java) once you have your model trained. Model training can take a while though! Unfortunately it's not always easy to find example code. I have some examples of how to use MALLET programmatically (along with the Stanford Parser, which is slow and probably overkill for your purposes). NLTK is a great learning tool and is simple enough that is you can prototype what you are doing there, that's ideal.
NLP is more about features and data quality than which machine learning method you use. 3-grams might be good, but how about character n-grams across those? Ie, all the character ngrams in a 3-gram to account for spelling variations/stemming/etc? Named entities might also be useful, or some sort of lexicon.
Upvotes: 2
Reputation: 75095
Have you tried MALLET?
I can't be sure that it will handle your particular dataset but I've found it to be quite robust in previous tests of mine.
However, I my focus was on topic modeling rather than classification per se.
Also, beware that with many NLP solutions you needn't input the "features" yourself (as the N-grams, i.e. the three-words-phrases and two-word-phrases mentioned in the question) but instead rely on the various NLP functions to produce their own statistical model.
Upvotes: 0