Reputation: 1745
I want to text classification based on the keywords appear in the text, because I do not have sample data to use naive bayes for text classification.
Example:
my document has some few words as "family, mother , father , children ... " that the categories of document are family.Or "football, tennis, score ... " that the category is sport
What is the best algorithm in this case ?.And is there any api java for this problem?
Upvotes: 2
Views: 4063
Reputation: 77454
You likely will need an auxillary data set for this. You cannot rely on your data set to convey the information that "dad" and "father" and "husband" have a similar meaning.
You can try to do mine for co-occurrences to detect near-synonyms, but this is not very reliable.
Probably wordnet etc. are a good place to disambiguate such words.
Upvotes: 1
Reputation: 12592
You can download the freebase topic collection: http://wiki.freebase.com/wiki/Topic_API.
Upvotes: 0
Reputation: 151
What you have are feature labels, i.e., labels on features rather than instances. There are a few methods for exploiting these, but usually it is assumed that one has instance labels (i.e., labels on documents) in addition to feature labels. This paradigm is referred to as dual-supervision.
Anyway, I know of at least two ways to learn from labeled features alone. The first is Generalized Expectation Criteria, which penalizes model parameters for diverging from a priori beliefs (e.g., that "moether" ought usually to correlate with "family"). This method has the disadvantage of being somewhat complex, but the advantage of having a nicely packaged, open-source Java implementation in the Mallet toolkit (see here, specifically).
A second option would basically be to use Naive Bayes and give large priors to the known word/class associations -- e.g., P("family"|"mother") = .8, or whatever. All unlabeled words would be assigned some prior, presumably reflecting class distribution. You would then effectively being making decisions only based on the prevalence of classes and the labeled term information. Settles proposed a model like this recently, and there is a web-tool available.
Upvotes: 2