Reputation: 53
I want to classify a collection of text into two class, let's say I would like to do a sentiment classification. I have two pre-made sentiment dictionaries, one contain only positive words and another contain only negative words. I would like to incorporate these dictionaries into feature vector for SVM classifier. My question is, is it possible to separate between positive and negative words dictionary to be represented as SVM feature vector, especially when I generate feature vector for the test set?
If my explanation is not clear enough, let me give the example. Let's say I have these two sentences as training data:
Pos: The book is good
Neg: The book is bad
Word 'good' exists in positive dictionary and 'bad' exists in negative dictionary, while other words do not exist in neither dictionary. I want the words that exist in matching dictionary with the sentence's class have a big weight value, while other words have small value. So, the feature vectors will be like these:
+1 1:0.1 2:0.1 3:0.1 4:0.9
-1 1:0.1 2:0.1 3:0.1 5:0.9
If I want to classify a test sentence "The food is bad", how should I generate a feature vector for the test set with weight that depend on existing dictionary when I cannot match test sentence's class with each of the dictionary? What I can think is, for test set, as long as the word exist in both dictionary, I will give the word a high weight value.
0 1:0.1 3:0.1 5:0.9
I wonder if this is the right way for creating vector representation for both training set and test set.
--Edit-- I forgot to mention that these pre-made dictionaries was extracted using some kind of topic model. For example, the top 100 words from topic 1 are kinda represent positive class and words in topic 2 represent negative class. I want to use this kind of information to improve the classifier more than using only bag-of-words feature.
Upvotes: 1
Views: 544
Reputation: 66775
In short - this is not the way it works.
The whole point of learning is to give classifier ability to assign these weights on their own. You cannot "force it" to have a high value per class for a particular feature (I mean, you could on the optimization level, but this would require changing the whole svm structure).
So the right way is to simply create a "normal" representation. Without any additional specification. Let the model decide, they are better at statistical analysis than human intuition, really.
Upvotes: 1