Reputation: 89

NLP software for classification of large datasets

For years I've been using my own Bayesian-like methods to categorize new items from external sources based on a large and continually updated training dataset.

There are three types of categorization done for each item:

30 categories, where each item must belong to one category, and at most two categories.
10 other categories, where each item is only associated with a category if there is a strong match, and each item can belong to as many categories as match.
4 other categories, where each item must belong to only one category, and if there isn't a strong match the item is assigned to a default category.

Each item consists of English text of around 2,000 characters. In my training dataset there are about 265,000 items, which contain a rough estimate of 10,000,000 features (unique three word phrases).

My homebrew methods have been fairly successful, but definitely have room for improvement. I've read the NLTK book's chapter "Learning to Classify Text", which was great and gave me a good overview of NLP classification techniques. I'd like to be able to experiment with different methods and parameters until I get the best classification results possible for my data.

The Question

What off-the-shelf NLP tools are available that can efficiently classify such a large dataset?

Those I've tried so far:

NLTK
TIMBL

I tried to train them with a dataset that consisted of less than 1% of the available training data: 1,700 items, 375,000 features. For NLTK I used a sparse binary format, and a similarly compact format for TIMBL.

Both seemed to rely on doing everything in memory, and quickly consumed all system memory. I can get them to work with tiny datasets, but nothing large. I suspect that if I tried incrementally adding the training data the same problem would occur either then or when doing the actual classification.

I've looked at Google's Prediction API, which seem to do much of what I'm looking for but not everything. I'd also like to avoid relying on an external service if possible.

About the choice of features: in testing with my homebrew methods over the years, three word phrases produced by far the best results. Although I could reduce the number of features by using words or two word phrases, that would most likely produce inferior results and would still be a large number of features.

Upvotes: 8

Answers (4)

Monis

Reputation: 91

I would recommend Mahout as it is intended for handling very large scale data sets. The ML algorithms are built over Apache Hadoop(map/reduce), so scaling is inherent.

Take a look at classification section below and see if it helps. https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

Upvotes: 2

Skarab

Reputation: 7121

After this post and based on the personal experience, I would recommend Vowpal Wabbit. It is said to have one of the fastest text classification algorithms.

Upvotes: 3

nflacco

Reputation: 5082

MALLET has a number of classifiers (NB, MaxEnt, CRF, etc). It's written Andrew McCallum's group. SVMLib is another good option, but SVM models typically require a bit more tuning than MaxEnt. Alternatively some sort of online clustering like K-means might not be bad in this case.

SVMLib and MALLET are quite fast (C and Java) once you have your model trained. Model training can take a while though! Unfortunately it's not always easy to find example code. I have some examples of how to use MALLET programmatically (along with the Stanford Parser, which is slow and probably overkill for your purposes). NLTK is a great learning tool and is simple enough that is you can prototype what you are doing there, that's ideal.

NLP is more about features and data quality than which machine learning method you use. 3-grams might be good, but how about character n-grams across those? Ie, all the character ngrams in a 3-gram to account for spelling variations/stemming/etc? Named entities might also be useful, or some sort of lexicon.

Upvotes: 2

mjv

Reputation: 75095

Have you tried MALLET?

I can't be sure that it will handle your particular dataset but I've found it to be quite robust in previous tests of mine.
However, I my focus was on topic modeling rather than classification per se.

Also, beware that with many NLP solutions you needn't input the "features" yourself (as the N-grams, i.e. the three-words-phrases and two-word-phrases mentioned in the question) but instead rely on the various NLP functions to produce their own statistical model.

Upvotes: 0

NLP software for classification of large datasets

The Question

Answers (4)

Related Questions