Reputation: 187
I am new to pandas/python. I would like to know how the function .corr remove the null data of a dataframe with multiple variables when computing the correlation.
For example, let's suppose I have the following dataframe:
# 'A1' 'A2' 'A3'
1 4 3 1
2 2 5 NA
3 3 2 NA
4 NA 10 2
1) Does it remove the entire row in which there is at least one NA/null value? (in this case, only the first row would be considered to compute the correlation matrix)
OR
2) Does it compute pairwise correlation, only excluding individual values? (e.g. for correlation between 'A1' and 'A2', it computes rows 1, 2 and 3; and for correlation between 'A1' and 'A3', it computes row 1 and 4.)
I haven't found such information in the function .corr documentation. It only says it removes the null values. Sorry if it is a silly question. I would be happy to learn where I can find this kind of detailed information regarding functions.
Upvotes: 13
Views: 20130
Reputation: 15
Lets say there are 3 cols in Dataframe A,B,C. In one row, data is present for col A and C but for col B, it has NaN value.
Now if we do df.dropna().corr() than that row will be removed and and hence we have one less data in computing corr of A and C.
Now in second case if we do df.corr() than that particular row will not be removed and hence the corr of A and C would be slightly different than the previous case.
Upvotes: -1
Reputation: 3752
Pandas will ignore the pairwise correlation if it has NaN
value in one of the observations. We can verify that by removing the those values and checking the results.
df
Out[8]:
A1 A2 A3
0 4.0 3 1.0
1 2.0 5 NaN
2 3.0 2 NaN
3 NaN 10 2.0
With the following correlation results:
df.corr()
Out[9]:
A1 A2 A3
A1 1.000000 -0.654654 NaN
A2 -0.654654 1.000000 1.0
A3 NaN 1.000000 1.0
Now if we remove the NaN
from column A1
we can check that the result is the same:
df[pd.isnull(df['A1'])==False].corr()
Out[10]:
A1 A2 A3
A1 1.000000 -0.654654 NaN
A2 -0.654654 1.000000 NaN
A3 NaN NaN NaN
Similarly to A3:
df[pd.isnull(df['A3'])==False].corr()
A1 A2 A3
A1 NaN NaN NaN
A2 NaN 1.0 1.0
A3 NaN 1.0 1.0
Edit
Just to complement a bit the answer, and referring back to this answer, you can see that pandas will ignore NaN
values in the calculations whereas numpy np.corrcoef
will not:
np.corrcoef(df.values)
Out[12]:
array([[ 1., nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan]])
Upvotes: 16