How does .corr remove NA and null values?

Question

I am new to pandas/python. I would like to know how the function .corr remove the null data of a dataframe with multiple variables when computing the correlation.

For example, let's suppose I have the following dataframe:

  #  'A1'  'A2' 'A3'
  1   4     3    1
  2   2     5    NA
  3   3     2    NA
  4   NA    10   2

1) Does it remove the entire row in which there is at least one NA/null value? (in this case, only the first row would be considered to compute the correlation matrix)

OR

2) Does it compute pairwise correlation, only excluding individual values? (e.g. for correlation between 'A1' and 'A2', it computes rows 1, 2 and 3; and for correlation between 'A1' and 'A3', it computes row 1 and 4.)

I haven't found such information in the function .corr documentation. It only says it removes the null values. Sorry if it is a silly question. I would be happy to learn where I can find this kind of detailed information regarding functions.

rrcal · Accepted Answer

Pandas will ignore the pairwise correlation if it has NaN value in one of the observations. We can verify that by removing the those values and checking the results.

df

Out[8]: 
    A1  A2   A3
0  4.0   3  1.0
1  2.0   5  NaN
2  3.0   2  NaN
3  NaN  10  2.0

With the following correlation results:

df.corr()

Out[9]: 
          A1        A2   A3
A1  1.000000 -0.654654  NaN
A2 -0.654654  1.000000  1.0
A3       NaN  1.000000  1.0

Now if we remove the NaN from column A1 we can check that the result is the same:

df[pd.isnull(df['A1'])==False].corr()

Out[10]: 
          A1        A2  A3
A1  1.000000 -0.654654 NaN
A2 -0.654654  1.000000 NaN
A3       NaN       NaN NaN

Similarly to A3:

df[pd.isnull(df['A3'])==False].corr()

    A1   A2   A3
A1 NaN  NaN  NaN
A2 NaN  1.0  1.0
A3 NaN  1.0  1.0

Edit

Just to complement a bit the answer, and referring back to this answer, you can see that pandas will ignore NaN values in the calculations whereas numpy np.corrcoef will not:

np.corrcoef(df.values)

Out[12]: 
array([[ 1., nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan]])

How does .corr remove NA and null values?

Answers (2)

Related Questions