Reputation: 53
When I use pandas.DataFrame.corr()
to create a correlation matrix, I found the correlation matrix(corr_matrix
) has 37 columns and the DataFrame(all_data
) has 80 columns. In my mind, these two columns should be the same. In another word, the correlation matrix should have the shape (80 x 80). But this did not happen. I have imputed all missing data before creating the correlation matrix. So why the two columns are not equal?
The code
corr_matrix = all_data.corr(method="kendall").abs()
print("Missing value descending:\n{}\n".format(all_data.isnull().sum().sort_values(ascending=False)[:5]))
print("Original Dataframe shape: {}".format(all_data.shape))
print("Correlation Matrix shape: {}".format(corr_matrix.shape))
The output
Missing value descending:
MSSubClass 0
MSZoning 0
GarageYrBlt 0
GarageType 0
FireplaceQu 0
dtype: int64
Original Dataframe shape: (2904, 80)
Correlation Matrix shape: (37, 37)
Upvotes: 3
Views: 366
Reputation: 5433
Does the train
DataFrame contain categorical columns?
Only the correlation between numerical columns is considered, categorical columns are ignored. At least, based on the following example
train = pd.DataFrame({
"cat1": list("ABC"),
"cat2": list("xyz"),
"num1": [1,2,3],
"num2": [-2,10,-5]
})
# 2 numerical and 2 categorical columns
>>> train
cat1 cat2 num1 num2
0 A x 1 -2
1 B y 2 10
2 C z 3 -5
# only numerical columns are present
>>> train.corr(method="kendall").abs()
num1 num2
num1 1.000000 0.333333
num2 0.333333 1.000000
Upvotes: 1