Reputation: 1867
I need to develop custom text categorization solution that does not use input text as a set of features, but rather some derived parameters, e.g. number of URLs in text, number of words representing different speech part, average word length etc. (let's assume we are able to derive set of features given input document).
Originally I thought about using OpenNLP to do categorization for me (via DocumentCategorizerME), but as I see it uses only text string as possible features and it is not possible to use non-discreet features (e.g. floating-point number that represents average word length).
So the questions are:
Upvotes: 0
Views: 396
Reputation: 41
If you showed up from Google like me, you may notice that OpenNLP has an extraInformation parameter in the classify method. Unfortunately, it's not used at all :(
This means that the suggestion Renaud gave, maybe the best alternative.
Alternatively, if you must use OpenNLP, you could include new features by just including a new word in the data (both in training and prediction). Such as: XAverageWordLengthX. I'm not saying it a great solution, but could help your algorithm.
Upvotes: 0