Reputation: 367
I'm trying to get some information on the correlation of the independent variables.
My dataset has a lot of variables, therefore the heatmap is not solution, it is very unreadable.
Currently, I have made a function that returns only those variables that are highly correlated. I would like to change it in way to indicate pairs of correlated features.
The rest of the explanations below:
def find_correlated_features(df, threshold, target_variable):
df_1 = df.drop(target_variable)
#corr_matrix has in index and columns names of variables
corr_matrix = df_1.corr().abs()
# I'm taking only half of this matrix to prevent doubling results
half_of_matrix = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))
# This prints list of columns which are correlated
to_drop = [column for column in half_of_matrix.columns if any(half_of_matrix[column] > threshold)]
return to_drop
The best if this function would return pandas dataframe with column_1; column_2; corr_coef only variables that are above threshold.
Something like this:
output = {'feature name 1': column_name,
'feature name 2': index,
'correlation coef': corr_coef}
output_list.append(output)
return pd.DataFrame(output_list).sort_values('corr_coef', ascending=False)
Upvotes: 2
Views: 1962
Reputation: 1253
This should match the output you're looking for:
import pandas as pd
import numpy as np
# Create fake correlation matrix
corr_matrix = np.random.random_sample((5, 5))
ii, jj = np.triu_indices(corr_matrix.shape[0], 1)
scores = []
for i, j in zip(ii, jj):
scores.append((i,j,corr_matrix[i,j]))
df_out = pd.DataFrame(data=scores,columns=['feature name 1','feature name 2','corr_coef'])\
.sort_values('corr_coef', ascending=False)\
.reset_index(drop=True)
threshold = 0.1
df_out[df_out['corr_coef'] > threshold]
# feature name 1 feature name 2 corr_coef
# 0 0 2 0.990691
# 1 2 4 0.990444
# 2 0 1 0.830640
# 3 1 2 0.623895
# 4 1 4 0.433258
# 5 3 4 0.404395
# 6 0 4 0.291564
# 7 2 3 0.276799
# 8 1 3 0.177519
And you can map the indices of the features (in columns feature name 1
and feature name 2
above) to the columns of your df_1 to get the actual feature names.
So your complete function would look like this:
def find_correlated_features(df, threshold, target_variable):
df_1 = df.drop(target_variable)
#corr_matrix has in index and columns names of variables
corr_matrix = df_1.corr().abs().to_numpy()
ii, jj = np.triu_indices(corr_matrix.shape[0], 1)
scores = []
for i, j in zip(ii, jj):
scores.append((i,j,corr_matrix[i,j]))
df_out = pd.DataFrame(data=scores,columns=['feature name 1','feature name 2','corr_coef'])\
.sort_values('corr_coef', ascending=False)\
.reset_index(drop=True)
# This should go from the second column as the index loop that gave us
# the scores and indices were from the upper triangle offset by 1
feature_name_map = {i:c for i,c in enumerate(df_1.columns[1:])}
df_out['feature name 1'] = df_out['feature name 1'].map(feature_name_map)
df_out['feature name 2'] = df_out['feature name 2'].map(feature_name_map)
return df_out[df_out['corr_coef'] > threshold]
Upvotes: 2
Reputation: 4929
After Edit:
After OP comment and @user6386471 answer, I've read again the question and I think that a simply restructure of the correlation matrix would work, with no need of loops. Like half_of_matrix.stack().reset_index()
plus filters. See:
def find_correlated_features(df, threshold, target_variable):
# remove target column
df = df.drop(columns=target_variable).copy()
# Get correlation matrix
corr_matrix = df.corr().abs()
# Take half of the matrix to prevent doubling results
corr_matrix = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))
# Restructure correlation matrix to dataframe
df = corr_matrix.stack().reset_index()
df.columns = ['feature1', 'feature2', 'corr_coef']
# Apply filter and sort coefficients
df = df[df.corr_coef >= threshold].sort_values('corr_coef', ascending=False)
return df
Original answer:
You can easily create a Series
with the coefficients above a threshold like this:
s = df.corr().loc[target_col]
s[s.abs() >= threshold]
where df
is your dataframe, target_col
your target column, and threshold
, you know, the threshold.
Example:
import pandas as pd
import seaborn as sns
df = sns.load_dataset('iris')
print(df.shape)
# -> (150, 5)
print(df.head())
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
def find_correlated_features(df, threshold, target_variable):
s = df.corr().loc[target_variable].drop(target_variable)
return s[s.abs() >= threshold]
find_correlated_features(df, .7, 'sepal_length')
output:
petal_length 0.871754
petal_width 0.817941
Name: sepal_length, dtype: float64
You can use .to_frame()
followed by .T
to the outptut to get a pandas dataframe.
Upvotes: 2