Mark
Mark

Reputation: 402

Fine Text Classification - what algorithm?

I'm looking to implement a classifier with approximately 150 categories (probably in Java) mostly for tweets (so very small documents).Some of the classes have very similar domains eg. 'Companies', 'Competition', 'Consumers' , 'International law', 'International organisations', 'International politics and government' . What algorithm/ approach is best when such a high resolution is needed? I've tried Naive Bayes (obv) and so far it hasn't performed very well (although that could just be due to the quality of the training data). The communities thoughts would be very welcome!

Thanks,

Mark

Upvotes: 4

Views: 989

Answers (3)

Wesley Baugh
Wesley Baugh

Reputation: 3770

It might be worthwhile to come up with a hierarchical classifier built from (potentially many) levels of sub-classifiers (i.e., come up with a taxonomy for your document labels).

Single classifier

single classifier with many possible class labels

A single classifier can output any of the many possible class labels.

Hierarchical classifier

hierarchical classifier

A hierarchical classifier groups related class labels together, and performs additional layers of classification until a leaf node is reached (or until the confidence drops below a certain threshold).

Intuition

The intuition is that the classifier will have an easier time learning discriminative features when the number of categories is fewer.

For example, a hierarchical classifier may have an easier time learning that player is a good feature indicative of sports, whereas a single classifier would have a more difficult time if player was only seen for one category (basketball) and not another (hockey).

Upvotes: 5

miguelmalvarez
miguelmalvarez

Reputation: 930

You should try different algorithms, as no model is known to outperform the rest. Weka (as suggested by @Sanz) or RapidMiner are good options to try multiple classifiers without too much trouble.

The problem in your case is that tweets carry a very limited amount of information, and the issue is not which method to apply, but how to represent the information. You should try some techniques for knowledge augmentation using tweet data such as the author or the hashtags. Do you have access to this information?

Considering multi-label methods is also a good option. However, I would focus on data representation and augmentation first.

Regards,

Upvotes: 2

Evans
Evans

Reputation: 1599

WEKA

Weka is a tool for experiment with different with various machine learning models (Naive Bayes, C4.5, OneR, SVM, K-NN...), one of the most used for data mining. Maybe you want to experiment with different models to see what fits best in your problem.

You can call the algorithms from you Java Code or use their executable to run them directly on your dataset.

As you categories are similar, maybe you want to check too some Multi-label classification methods

Upvotes: 2

Related Questions