How to print correlated features using python pandas?

Question

I'm trying to get some information on the correlation of the independent variables.

My dataset has a lot of variables, therefore the heatmap is not solution, it is very unreadable.

Currently, I have made a function that returns only those variables that are highly correlated. I would like to change it in way to indicate pairs of correlated features.

The rest of the explanations below:

def find_correlated_features(df, threshold, target_variable):

    df_1 = df.drop(target_variable)

    #corr_matrix has in index and columns names of variables
    corr_matrix = df_1.corr().abs()

    # I'm taking only half of this matrix to prevent doubling results
    half_of_matrix = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))

    # This prints list of columns which are correlated 
    to_drop = [column for column in half_of_matrix.columns if any(half_of_matrix[column] > threshold)]
    
    return to_drop

The best if this function would return pandas dataframe with column_1; column_2; corr_coef only variables that are above threshold.

Something like this:

output = {'feature name 1': column_name,
          'feature name 2': index,
          'correlation coef': corr_coef}

output_list.append(output)
return pd.DataFrame(output_list).sort_values('corr_coef', ascending=False)

Cain&#227; Max Couto da Silva · Accepted Answer

After Edit:

After OP comment and @user6386471 answer, I've read again the question and I think that a simply restructure of the correlation matrix would work, with no need of loops. Like half_of_matrix.stack().reset_index() plus filters. See:

def find_correlated_features(df, threshold, target_variable):
    # remove target column
    df = df.drop(columns=target_variable).copy()
    # Get correlation matrix
    corr_matrix = df.corr().abs()
    # Take half of the matrix to prevent doubling results
    corr_matrix = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))
    # Restructure correlation matrix to dataframe
    df = corr_matrix.stack().reset_index()
    df.columns = ['feature1', 'feature2', 'corr_coef']
    # Apply filter and sort coefficients
    df = df[df.corr_coef >= threshold].sort_values('corr_coef', ascending=False)
    return df

Original answer:

You can easily create a Series with the coefficients above a threshold like this:

s = df.corr().loc[target_col]
s[s.abs() >= threshold]

where df is your dataframe, target_col your target column, and threshold, you know, the threshold.

Example:

import pandas as pd
import seaborn as sns

df = sns.load_dataset('iris')

print(df.shape)
# -> (150, 5)

print(df.head())

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

def find_correlated_features(df, threshold, target_variable):
    s = df.corr().loc[target_variable].drop(target_variable)
    return s[s.abs() >= threshold]

find_correlated_features(df, .7, 'sepal_length')

output:

petal_length    0.871754
petal_width     0.817941
Name: sepal_length, dtype: float64

You can use .to_frame() followed by .T to the outptut to get a pandas dataframe.

How to print correlated features using python pandas?

Answers (2)

Related Questions