Alex Z
Alex Z

Reputation: 1867

Text categorization based on custom features

I need to develop custom text categorization solution that does not use input text as a set of features, but rather some derived parameters, e.g. number of URLs in text, number of words representing different speech part, average word length etc. (let's assume we are able to derive set of features given input document).

Originally I thought about using OpenNLP to do categorization for me (via DocumentCategorizerME), but as I see it uses only text string as possible features and it is not possible to use non-discreet features (e.g. floating-point number that represents average word length).

So the questions are:

  1. Am I missing something? Is it actually possible to adapt OpenNLP to use it with integer or floating-point features for categorization
  2. If no, what is the suggested library / toolkit I should use?

Upvotes: 0

Views: 396

Answers (2)

Quackquack
Quackquack

Reputation: 41

If you showed up from Google like me, you may notice that OpenNLP has an extraInformation parameter in the classify method. Unfortunately, it's not used at all :(

This means that the suggestion Renaud gave, maybe the best alternative.

Alternatively, if you must use OpenNLP, you could include new features by just including a new word in the data (both in training and prediction). Such as: XAverageWordLengthX. I'm not saying it a great solution, but could help your algorithm.

Upvotes: 0

Renaud
Renaud

Reputation: 16501

You should try Mallet to train your own classifier with your own features. Here is a tutorial to get you started.

Upvotes: 1

Related Questions