Reputation: 46
I am new to data science and trying to get a grip on exploratory data analysis. My goal is to get a correlation matrix between all the variables. For numerical variables I use Pearson's R, for categorical variables I use the corrected Cramer's V. The issue now is to get a meaningful correlation between categorical and numerical variables. For that I use the correlation ratio, as outlined here. The issue with that is that categorical variables with high cardinality show a high correlation no matter what:
correlation matrix cat vs. num
This seems nonsensical, since this would practically show the cardinality of the the categorical variable instead of the correlation to the numerical variable. The question is: how to deal with the issue in order to get a meaningful correlation.
The Python code below shows how I implemented the correlation ratio:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
train = pd.DataFrame({
'id': [0,1,2,3,4,5,6,7,8,9,10,11], 'num3': [6,3,3,9,6,9,9,3,6,3,6,9],
'cat2': [0,1,0,1,0,1,0,1,0,1,0,1], 'cat3': [0,1,2,0,1,2,0,1,2,0,1,2],
'cat6': [0,4,8,2,6,10,0,4,8,2,6,10], 'cat12': [0,7,2,9,4,11,6,1,8,3,10,5],
})
cat_cols, num_cols = ['cat2','cat3','cat6','cat12'], ['id','num3']
def corr_ratio(cats, nums):
avgtotal = nums.mean()
elements_avg, elements_count = np.zeros(len(cats.index)), np.zeros(len(cats.index))
cu = cats.unique()
for i in range(cu.size):
cn = cu[i]
filt = cats == cn
elements_count[i] = filt.sum()
elements_avg[i] = nums[filt].mean(axis=0)
numerator = np.sum(np.multiply(elements_count, np.power(np.subtract(elements_avg, avgtotal), 2)))
denominator = np.sum(np.power(np.subtract(nums, avgtotal), 2)) # total variance
return 0.0 if numerator == 0 else np.sqrt(numerator / denominator)
rows = []
for cat in cat_cols:
col = []
for num in num_cols:
col.append(round(corr_ratio(train[cat], train[num]), 2))
rows.append(col)
df = pd.DataFrame(np.array(rows), columns=num_cols, index=cat_cols)
sns.heatmap(df)
plt.tight_layout()
plt.show()
Upvotes: 2
Views: 812
Reputation: 26
It could be because I think you are visualising something more related to chi-2 in your seaborn plot. Cramer's V is a number derived from chi-2 but not equivalent. So it means you could have a high value for a specific cell but a more relevant value for Cramer's V. I'm not even sure it makes sense to compare raw modalities values because they could be on a totally different order of magnitude.
Chi 2 formula Cramer's V formula
Upvotes: 1
Reputation: 142
If I am not mistaken, there is another method called Theil’s U. How about trying this out and see if the same problem will occur?
You can use this:
num_cols: your_df.select_dtypes(include=['number']).columns.to_list()
cat_target_cols: your_df.select_dtypes(include=['object']).columns.to_list()
corr_df = pd.DataFrame(associations(dataset=your_df, numerical_columns=num_cols, nom_nom_assoc='theil', figsize=(20, 20), nominal_columns=cat_target_cols).get('corr'))
Upvotes: 0