How to Measure the difference between features in dataframe?

I have a dataframe with around 20000 rows and 98 features (all the features are numerical) and a target feature with binary values: 0 and 1. Basically, there are two population (first population with target value 1 --50%--, and the second with target value 0 -50%- balanced data). In a classification problem, I tried to predict the target value given the data. So, I have implanted a supervised learning algorithm (e.g., SVM) to predict the target value, and could obtain a very good result with around 0.95 accuracy. This result gives me a point that there is a considerable difference between the features. So, in the next step, I have to know what are the important features which made this difference, and what is best way to quantify this difference in the features between these two group of population. Any idea?

Upvotes: 0

Answers (3)

Aiden Zhao

Reputation: 653

Can you try to use KS-test for your features? for example, feature 1, split by it's class. then you get two groups. Then test if they come from different distribution or just record the p-value.

when you have all the test result or p-value, make another model with the samples that comes from different distribution / very low p-value. see if the new model is better or similar.

not sure if this achieves anything. wanted to comment but couldn't do so.

Upvotes: 0

doctorlove

Reputation: 19252

Aside from using the coefficients of the support vectors from your model, you could try build other models.

A decision tree approach will explicitly show you which input features split the data - those nearer the root being more important, for some definition of important.

If you try a feature reduction technique, like PCA, and rebuild you model, the coefficients of the components here will tell you which contribute most.

Or you could be completely thug-headed, and build lots of models leaving out some features and see which are better.

Or you could be lateral, and consider what's so different about the few points that your model doesn't accurately classify.

Upvotes: 0

Dmytro Prylipko

Reputation: 5064

To rank you features by importance, you can use Weka with its powerful toolkit for feature selection. See this blogpost for more info and examples. By the way, Weka also has SVM implementation. Once you have identified important features, you can visualize how different they are between the two classes e.g. by plotting their distributions for the classes. Matplotlib has tools like hist or boxplot for this.

If you have SVM with linear kernel, you can use its coefficients as direct decision weights for the input features:

Upvotes: 1

How to Measure the difference between features in dataframe?

Answers (3)

Related Questions