Why is Pandas and Numpy producing different results for pairwise correlations with NaN?

Question

I am trying to create a table of pairwise correlation for a model that I am building, and I have some numpy.nan values (NAN) in my dataset. For some reason, when I perform the correlation using np.corrcoef() I have different results than using pd.df.corr():

for instance:

dataset = np.array([[1,np.nan,np.nan,1,1],[1,np.nan,np.nan,3000,1]])
pandas_data = pd.DataFrame(dataset.transpose())

print np.corrcoef(dataset)

to which I get:

[[ nan  nan]
[ nan  nan]]

but with the pandas dataframe I do have one result:

print pandas_data.corr()

    0   1
0 NaN NaN
1 NaN   1

Is there a fundamental difference in the way they handle NaN, or I missed something? (Also, why is my correlation 1 if I do have different values?) Thanks

user6655984 · Accepted Answer

NumPy's default behavior is to propagate NaNs. That is, it performs the computations with the entire array, and every time something is added to NaN (or multiplied by, etc), the result is NaN. This is reasonable: if a = 5 and b = NaN, a + b should be NaN. Consequently, the variance of an array containing at least one NaN is NaN, and so is the correlation of that array with any other array.

The raw-data-oriented nature of pandas leads to different design decisions: it tries to extract as much information as possible from incomplete data. In particular, the corr method is designed (and documented) to exclude NaN.

To reproduce pandas behavior in NumPy, use a boolean mask valid as below: it requires that there are no NaN values in the column.

dataset = np.array([[1, 2, 3, 4, np.nan], [1, 0, np.nan, 8, 9]])

valid = ~np.isnan(dataset).any(axis=0)
numpy_corr = np.corrcoef(dataset[:, valid])

pandas_data = pd.DataFrame(dataset.transpose())    
pandas_corr = pandas_data.corr()

Both correlation methods now return the same result:

  [[ 1.        ,  0.90112711],
   [ 0.90112711,  1.        ]])

The diagonal entries represent the correlation of an array with itself, which is always 1 (theoretically; in practice it's 1 within machine precision).

Why is Pandas and Numpy producing different results for pairwise correlations with NaN?

Answers (1)

Related Questions