Reputation: 137
I want to calculate in python the correlation of all my features (all of float type) and the class label (Binary, 0 or 1). In addition, I would like to plot the data to visualize their distribution by class.
This is needed so I can find features coupled to a single label and find out their real importance. Note that I don't want the pairwise feature correlation and that my classifier is binary.
I have tried the following (from a similar post in stackoverflow) but it is not exactly what I am looking for.
df.drop("Target", axis=1).apply(lambda x: x.corr(df.Target))
Please see in the picture attached how the distribution would look like for one the features (from Weka).
Class distribution for one of the features
Any feedback is really appreciated.
Upvotes: 4
Views: 7745
Reputation: 16966
Correlation is not supposed to be used for categorical variables. For more explanation see here
You can understand the relationship between your independent variables and target variables with the following approach.
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer(return_X_y=False)
import pandas as pd
df=pd.DataFrame(data.data[:,:5])
df.columns = data.feature_names[:5]
df['target'] = data.target.astype(str)
import seaborn as sns;
import matplotlib.pyplot as plt
g= sns.pairplot(df,hue = 'target', diag_kind= 'hist',
vars=df.columns[:-1],
plot_kws=dict(alpha=0.5),
diag_kws=dict(alpha=0.5))
plt.show()
Upvotes: 8