Stefano
Stefano

Reputation: 179

Graphically show correlation between columns of a pandas dataframe

I have the following pandas dataframe covering more than 10k answers for 150 questions.

Pandas Dataframe

I am struggling to find a way to see the correlation between fields.

In particular I would like to understand how I can graphically show the correlation between Q015 and Q008, knowing that each respondent might have selected multiple answers (1,2,3).

So I am trying to figure out how to graphically display whether there is any correlation between Q015 and Q008 for each selected option of the survey.

Any ideas?

Upvotes: 0

Views: 556

Answers (1)

Samir Hinojosa
Samir Hinojosa

Reputation: 825

You can see a linear regression by Pearson

necessary libraries

import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

Code

list_variables, list_COEF, list_MSE, list_RMSE, list_R2SCORE = ([] for i in range(5))
    
# initializing Linear Regression by Pearson
lr = LinearRegression()
xtrain, xtest, ytrain, ytest = train_test_split(df[["Q015"]], df[["Q008"]], test_size=0.3)
lr = LinearRegression()
lr_baseline = lr.fit(xtrain, ytrain)
pred_baseline = lr_baseline.predict(xtest)

list_variables.append("Q015 & Q008")
list_COEF.append(round(lr_baseline.coef_[0,0], 4))
list_MSE.append(round(mean_squared_error(ytest, pred_baseline), 2))
list_RMSE.append(round(math.sqrt(mean_squared_error(ytest, pred_baseline)), 2))
list_R2SCORE.append(round(r2_score(ytest, pred_baseline), 2))

# Plotting the graph
plt.figure(figsize=(12,8))
ax = plt.gca()

plt.suptitle("Q015 & Q008", fontsize=24, y=0.96)
plt.plot(xtest, ytest, 'bo', markersize = 5)
plt.plot(xtest, pred_baseline, color="red", linewidth = 2)
plt.xlabel("Q015", size=14)
plt.ylabel("Q008", size=14)
plt.tight_layout()
plt.show()

You will get something as follows where the column Coef. says to you how much the variables are correlated enter image description here

Another way is to see the matrix correlation

df_corr = pd.DataFrame(df[["Q015", "Q008"]].corr()).round(2)
mask = np.zeros_like(df_corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True 

plt.figure(figsize=(10,8))
plt.title("Pearson correlation between features", size=20)

ax = sns.heatmap(df_corr, mask=mask, vmin=-1, cmap="mako_r")

plt.xticks(rotation=25, size=14, horizontalalignment="right")
plt.yticks(rotation=0, size=14)
plt.tight_layout()
plt.show()

enter image description here

An example for numeric columns

df = pd.DataFrame(np.random.randint(0,15, size=(100, 6)), columns=[["Q01", "Q02", "Q03", "Q07", "Q015", "Q008"]])

enter image description here

enter image description here

Upvotes: 1

Related Questions