Reputation:
Ive got some 120k text files, and 12 categories in which I want to classify these documents into. Im using simple bag of words model and feeding it to NaiveBayes. But I was told that using a mixture of features would "help" OR rather I should atleast try. For instance :-
1.] POS tags + Bigrams,
2.] Bag-of-NER + POS tags
But the problem is how do I combine these two /three different features as a single feature for each of the document ? Secondly which "feature-mixture" is the best to help in document classification?
Upvotes: 0
Views: 256
Reputation: 4172
You can try following:
For each document calculate for example Bag of words vector and Bigrams vector.
Concatenate two vectors to get one big sparse vector.
Use some dimensionality reduction techniques that will find you low dimension embedding, where every feature will be combination of original features. You can try PCA or LDA(linear discriminant analysis).
Upvotes: 1