Reputation: 179

Graphically show correlation between columns of a pandas dataframe

I have the following pandas dataframe covering more than 10k answers for 150 questions.

I am struggling to find a way to see the correlation between fields.

In particular I would like to understand how I can graphically show the correlation between Q015 and Q008, knowing that each respondent might have selected multiple answers (1,2,3).

So I am trying to figure out how to graphically display whether there is any correlation between Q015 and Q008 for each selected option of the survey.

Any ideas?

Upvotes: 0

Answers (1)

Samir Hinojosa

Reputation: 825

You can see a linear regression by Pearson

necessary libraries

import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

Code

list_variables, list_COEF, list_MSE, list_RMSE, list_R2SCORE = ([] for i in range(5))
    
# initializing Linear Regression by Pearson
lr = LinearRegression()
xtrain, xtest, ytrain, ytest = train_test_split(df[["Q015"]], df[["Q008"]], test_size=0.3)
lr = LinearRegression()
lr_baseline = lr.fit(xtrain, ytrain)
pred_baseline = lr_baseline.predict(xtest)

list_variables.append("Q015 & Q008")
list_COEF.append(round(lr_baseline.coef_[0,0], 4))
list_MSE.append(round(mean_squared_error(ytest, pred_baseline), 2))
list_RMSE.append(round(math.sqrt(mean_squared_error(ytest, pred_baseline)), 2))
list_R2SCORE.append(round(r2_score(ytest, pred_baseline), 2))

# Plotting the graph
plt.figure(figsize=(12,8))
ax = plt.gca()

plt.suptitle("Q015 & Q008", fontsize=24, y=0.96)
plt.plot(xtest, ytest, 'bo', markersize = 5)
plt.plot(xtest, pred_baseline, color="red", linewidth = 2)
plt.xlabel("Q015", size=14)
plt.ylabel("Q008", size=14)
plt.tight_layout()
plt.show()

You will get something as follows where the column Coef. says to you how much the variables are correlated

Another way is to see the matrix correlation

df_corr = pd.DataFrame(df[["Q015", "Q008"]].corr()).round(2)
mask = np.zeros_like(df_corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True 

plt.figure(figsize=(10,8))
plt.title("Pearson correlation between features", size=20)

ax = sns.heatmap(df_corr, mask=mask, vmin=-1, cmap="mako_r")

plt.xticks(rotation=25, size=14, horizontalalignment="right")
plt.yticks(rotation=0, size=14)
plt.tight_layout()
plt.show()

An example for numeric columns

df = pd.DataFrame(np.random.randint(0,15, size=(100, 6)), columns=[["Q01", "Q02", "Q03", "Q07", "Q015", "Q008"]])

Upvotes: 1

Graphically show correlation between columns of a pandas dataframe

Answers (1)

Related Questions