Reputation: 926
The association between categorical variables should be computed using Crammer's V. Therefore, I found the following code to plot it, but I don't know why he plotted it for "contribution", which is a numeric variable?
def cramers_corrected_stat(confusion_matrix):
""" calculate Cramers V statistic for categorical-categorical association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
cols = ["Party", "Vote", "contrib"]
corrM = np.zeros((len(cols),len(cols)))
# there's probably a nice pandas way to do this
for col1, col2 in itertools.combinations(cols, 2):
idx1, idx2 = cols.index(col1), cols.index(col2)
corrM[idx1, idx2] = cramers_corrected_stat(pd.crosstab(df[col1], df[col2]))
corrM[idx2, idx1] = corrM[idx1, idx2]
corr = pd.DataFrame(corrM, index=cols, columns=cols)
fig, ax = plt.subplots(figsize=(7, 6))
ax = sns.heatmap(corr, annot=True, ax=ax); ax.set_title("Cramer V Correlation between Variables");
I also found Bokeh. However, I am not sure if it uses Crammer's V to plot the heatmap or not?
Really, I have two categorical features: the first one has 2 categories and the second one has 37 categories. Could you please let me know how to plot Crammer's V heatmap?
Some part of my dataset is here.
Thanks in advance.
Upvotes: 7
Views: 10051
Reputation: 4623
What's the problem? The code is absolutely right.
ax
in this case ia a correlation matrix beetwen variables.
Using "contribution" is not correct but you can see in the article bellow
Quote
*
"This isn't right to do on the Contribution variable, but we'll do more with a model later."
* The author shows this variable for example only. In your case what's the reason to make plot Crammer's V? You have just two variables (as I see) and you will get only one correlation coefficient Crammer's V
But of course you can repeat the code on your data and get plot Crammer's V heatmap
Upvotes: 2