Numerical vs. categorical vars: Why 100% correlation for categorical variable with high cardinality?

I am new to data science and trying to get a grip on exploratory data analysis. My goal is to get a correlation matrix between all the variables. For numerical variables I use Pearson's R, for categorical variables I use the corrected Cramer's V. The issue now is to get a meaningful correlation between categorical and numerical variables. For that I use the correlation ratio, as outlined here. The issue with that is that categorical variables with high cardinality show a high correlation no matter what:

correlation matrix cat vs. num

This seems nonsensical, since this would practically show the cardinality of the the categorical variable instead of the correlation to the numerical variable. The question is: how to deal with the issue in order to get a meaningful correlation.

The Python code below shows how I implemented the correlation ratio:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

train = pd.DataFrame({
    'id': [0,1,2,3,4,5,6,7,8,9,10,11], 'num3': [6,3,3,9,6,9,9,3,6,3,6,9],
    'cat2': [0,1,0,1,0,1,0,1,0,1,0,1], 'cat3': [0,1,2,0,1,2,0,1,2,0,1,2],
    'cat6': [0,4,8,2,6,10,0,4,8,2,6,10], 'cat12': [0,7,2,9,4,11,6,1,8,3,10,5],
})
cat_cols, num_cols = ['cat2','cat3','cat6','cat12'], ['id','num3']

def corr_ratio(cats, nums):
    avgtotal = nums.mean()
    elements_avg, elements_count = np.zeros(len(cats.index)), np.zeros(len(cats.index))
    cu = cats.unique()
    for i in range(cu.size):
        cn = cu[i]
        filt = cats == cn
        elements_count[i] = filt.sum()
        elements_avg[i] = nums[filt].mean(axis=0)
    numerator = np.sum(np.multiply(elements_count, np.power(np.subtract(elements_avg, avgtotal), 2)))
    denominator = np.sum(np.power(np.subtract(nums, avgtotal), 2))  # total variance
    return 0.0 if numerator == 0 else np.sqrt(numerator / denominator)

rows = []
for cat in cat_cols:
    col = []
    for num in num_cols:
        col.append(round(corr_ratio(train[cat], train[num]), 2))
    rows.append(col)

df = pd.DataFrame(np.array(rows), columns=num_cols, index=cat_cols)
sns.heatmap(df)
plt.tight_layout()
plt.show()

Upvotes: 2

Answers (2)

Pierrick Leroy

Reputation: 26

It could be because I think you are visualising something more related to chi-2 in your seaborn plot. Cramer's V is a number derived from chi-2 but not equivalent. So it means you could have a high value for a specific cell but a more relevant value for Cramer's V. I'm not even sure it makes sense to compare raw modalities values because they could be on a totally different order of magnitude.

Chi 2 formula Cramer's V formula

Upvotes: 1

Ming Jun Lim

Reputation: 142

If I am not mistaken, there is another method called Theil’s U. How about trying this out and see if the same problem will occur?

You can use this:
num_cols: your_df.select_dtypes(include=['number']).columns.to_list()
cat_target_cols: your_df.select_dtypes(include=['object']).columns.to_list()

corr_df = pd.DataFrame(associations(dataset=your_df, numerical_columns=num_cols, nom_nom_assoc='theil', figsize=(20, 20), nominal_columns=cat_target_cols).get('corr'))

Upvotes: 0

Numerical vs. categorical vars: Why 100% correlation for categorical variable with high cardinality?

Answers (2)

Related Questions