Rafaela V
Rafaela V

Reputation: 187

How does .corr remove NA and null values?

I am new to pandas/python. I would like to know how the function .corr remove the null data of a dataframe with multiple variables when computing the correlation.

For example, let's suppose I have the following dataframe:

  #  'A1'  'A2' 'A3'
  1   4     3    1
  2   2     5    NA
  3   3     2    NA
  4   NA    10   2

1) Does it remove the entire row in which there is at least one NA/null value? (in this case, only the first row would be considered to compute the correlation matrix)

OR

2) Does it compute pairwise correlation, only excluding individual values? (e.g. for correlation between 'A1' and 'A2', it computes rows 1, 2 and 3; and for correlation between 'A1' and 'A3', it computes row 1 and 4.)

I haven't found such information in the function .corr documentation. It only says it removes the null values. Sorry if it is a silly question. I would be happy to learn where I can find this kind of detailed information regarding functions.

Upvotes: 13

Views: 20130

Answers (2)

Rachit
Rachit

Reputation: 15

Lets say there are 3 cols in Dataframe A,B,C. In one row, data is present for col A and C but for col B, it has NaN value.

Now if we do df.dropna().corr() than that row will be removed and and hence we have one less data in computing corr of A and C.

Now in second case if we do df.corr() than that particular row will not be removed and hence the corr of A and C would be slightly different than the previous case.

Upvotes: -1

rrcal
rrcal

Reputation: 3752

Pandas will ignore the pairwise correlation if it has NaN value in one of the observations. We can verify that by removing the those values and checking the results.

df

Out[8]: 
    A1  A2   A3
0  4.0   3  1.0
1  2.0   5  NaN
2  3.0   2  NaN
3  NaN  10  2.0

With the following correlation results:

df.corr()

Out[9]: 
          A1        A2   A3
A1  1.000000 -0.654654  NaN
A2 -0.654654  1.000000  1.0
A3       NaN  1.000000  1.0

Now if we remove the NaN from column A1 we can check that the result is the same:

df[pd.isnull(df['A1'])==False].corr()

Out[10]: 
          A1        A2  A3
A1  1.000000 -0.654654 NaN
A2 -0.654654  1.000000 NaN
A3       NaN       NaN NaN

Similarly to A3:

df[pd.isnull(df['A3'])==False].corr()

    A1   A2   A3
A1 NaN  NaN  NaN
A2 NaN  1.0  1.0
A3 NaN  1.0  1.0

Edit

Just to complement a bit the answer, and referring back to this answer, you can see that pandas will ignore NaN values in the calculations whereas numpy np.corrcoef will not:

np.corrcoef(df.values)

Out[12]: 
array([[ 1., nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan]])

Upvotes: 16

Related Questions