Reputation: 81
I am trying to solve the "House Prices" challenge from Kaggle and I'm stuck on my correlation matrix because it simply doesn't show all columns I want. Initially, it was obviously because of the large number of columns, so I did this:
df = df_data[['SalePrice', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities']].copy()
corrmax = df.corr()
f, ax = plt.subplots(figsize=(16,12))
sns.heatmap(corrmax, annot = True)
And then, the result is a heatmap with only SalePrice, MSSubClass, LotFrontage and LotArea for some reason. Can anyone please help me?
Upvotes: 4
Views: 8889
Reputation: 1315
If you analysis the dataset of House Prices House Prices there are about 21-23 categorical variables 'MSZoning','Alley' The corr() matrix only show their relationship between the numerical values or non-categorical variables
corrmax = df.corr()
If you want to find the relation between the categorical and non-categorical variables use need to use the Spearman correlation matrix
You will find some help from the links below...
An overview of correlation measures between categorical and continuous variables
Correlation between a nominal (IV) and a continuous (DV) variable
Upvotes: 4