Reputation: 402
I'm looking to implement a classifier with approximately 150 categories (probably in Java) mostly for tweets (so very small documents).Some of the classes have very similar domains eg. 'Companies', 'Competition', 'Consumers' , 'International law', 'International organisations', 'International politics and government' . What algorithm/ approach is best when such a high resolution is needed? I've tried Naive Bayes (obv) and so far it hasn't performed very well (although that could just be due to the quality of the training data). The communities thoughts would be very welcome!
Thanks,
Mark
Upvotes: 4
Views: 989
Reputation: 3770
It might be worthwhile to come up with a hierarchical classifier built from (potentially many) levels of sub-classifiers (i.e., come up with a taxonomy for your document labels).
A single classifier can output any of the many possible class labels.
A hierarchical classifier groups related class labels together, and performs additional layers of classification until a leaf node is reached (or until the confidence drops below a certain threshold).
The intuition is that the classifier will have an easier time learning discriminative features when the number of categories is fewer.
For example, a hierarchical classifier may have an easier time learning that player
is a good feature indicative of sports, whereas a single classifier would have a more difficult time if player
was only seen for one category (basketball) and not another (hockey).
Upvotes: 5
Reputation: 930
You should try different algorithms, as no model is known to outperform the rest. Weka (as suggested by @Sanz) or RapidMiner are good options to try multiple classifiers without too much trouble.
The problem in your case is that tweets carry a very limited amount of information, and the issue is not which method to apply, but how to represent the information. You should try some techniques for knowledge augmentation using tweet data such as the author or the hashtags. Do you have access to this information?
Considering multi-label methods is also a good option. However, I would focus on data representation and augmentation first.
Regards,
Upvotes: 2
Reputation: 1599
Weka is a tool for experiment with different with various machine learning models (Naive Bayes, C4.5, OneR, SVM, K-NN...), one of the most used for data mining. Maybe you want to experiment with different models to see what fits best in your problem.
You can call the algorithms from you Java Code or use their executable to run them directly on your dataset.
As you categories are similar, maybe you want to check too some Multi-label classification methods
Upvotes: 2