user4069366
user4069366

Reputation:

how to combine and feed different features to an algorithm for text classification

Ive got some 120k text files, and 12 categories in which I want to classify these documents into. Im using simple bag of words model and feeding it to NaiveBayes. But I was told that using a mixture of features would "help" OR rather I should atleast try. For instance :-

1.] POS tags + Bigrams, 
2.] Bag-of-NER + POS tags 

But the problem is how do I combine these two /three different features as a single feature for each of the document ? Secondly which "feature-mixture" is the best to help in document classification?

Upvotes: 0

Views: 256

Answers (1)

Farseer
Farseer

Reputation: 4172

You can try following:

For each document calculate for example Bag of words vector and Bigrams vector.

Concatenate two vectors to get one big sparse vector.

Use some dimensionality reduction techniques that will find you low dimension embedding, where every feature will be combination of original features. You can try PCA or LDA(linear discriminant analysis).

Upvotes: 1

Related Questions