user3544665
user3544665

Reputation: 55

Find the best set of features to separate 2 known group of data

I need some point of view to know if what I am doing is good or wrong or if there is better way to do it.

I have 10 000 elements. For each of them I have like 500 features.

I am looking to measure the separability between 2 sets of those elements. (I already know those 2 groups I don't try to find them) For now I am using svm. I train the svm on 2000 of those elements, then I look at how good the score is when I test on the 8000 other elements.

Now I would like to now which features maximize this separation.

My first approach was to test each combination of feature with the svm and follow the score given by the svm. If the score is good those features are relevant to separate those 2 sets of data. But this takes too much time. 500! possibility.

The second approach was to remove one feature and see how much the score is impacted. If the score changes a lot that feature is relevant. This is faster, but I am not sure if it is right. When there is 500 feature removing just one feature don't change a lot the final score.

Is this a correct way to do it?

Upvotes: 1

Views: 2022

Answers (4)

Bernard
Bernard

Reputation: 301

Have you thought about Linear Discriminant Analysis (LDA)?

LDA aims at discovering a linear combination of features that maximizes the separability. The algorithm works by projecting your data in a space where the variance within classes is minimum and the one between classes is maximum.

You can use it reduce the number of dimensions required to classify, and also use it as a linear classifier.

However with this technique you would lose the original features with their meaning, and you may want to avoid that.

If you want more details I found this article to be a good introduction.

Upvotes: 1

Bob Dillon
Bob Dillon

Reputation: 341

You can do statistical analysis on the features to get indications of which terms best separate the data. I like Information Gain, but there are others.

I found this paper (Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002) to be a good theoretical treatment of text classification, including feature reduction by a variety of methods from the simple (Term Frequency) to the complex (Information-Theoretic).

These functions try to capture the intuition that the best terms for ci are the ones distributed most differently in the sets of positive and negative examples of ci. However, interpretations of this principle vary across different functions. For instance, in the experimental sciences χ2 is used to measure how the results of an observation differ (i.e., are independent) from the results expected according to an initial hypothesis (lower values indicate lower dependence). In DR we measure how independent tk and ci are. The terms tk with the lowest value for χ2(tk, ci) are thus the most independent from ci; since we are interested in the terms which are not, we select the terms for which χ2(tk, ci) is highest.

These techniques help you choose terms that are most useful in separating the training documents into the given classes; the terms with the highest predictive value for your problem. The features with the highest Information Gain are likely to best separate your data.

I've been successful using Information Gain for feature reduction and found this paper (Entropy based feature selection for text categorization Largeron, Christine and Moulin, Christophe and Géry, Mathias - SAC - Pages 924-928 2011) to be a very good practical guide.

Here the authors present a simple formulation of entropy-based feature selection that's useful for implementation in code:

Given a term tj and a category ck, ECCD(tj , ck) can be computed from a contingency table. Let A be the number of documents in the category containing tj ; B, the number of documents in the other categories containing tj ; C, the number of documents of ck which do not contain tj and D, the number of documents in the other categories which do not contain tj (with N = A + B + C + D):

enter image description here

Using this contingency table, Information Gain can be estimated by:

enter image description here

This approach is easy to implement and provides very good Information-Theoretic feature reduction.

You needn't use a single technique either; you can combine them. Term-Frequency is simple, but can also be effective. I've combined the Information Gain approach with Term Frequency to do feature selection successfully. You should experiment with your data to see which technique or techniques work most effectively.

Upvotes: 3

Aditya Patel
Aditya Patel

Reputation: 569

Have you tried any other method ? Maybe you can try decision tree or random forest, it would give out your best features based on entropy gain. Can i assume all the features are independent of each other. if not please remove those as well.

Also for Support vectors , you can try to check out this paper:

http://axon.cs.byu.edu/Dan/778/papers/Feature%20Selection/guyon2.pdf

But it's based more on linear SVM.

Upvotes: 3

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77485

If you want a single feature to discriminate your data, use a decision tree, and look at the root node.

SVM by design looks at combinations of all features.

Upvotes: 1

Related Questions