Kim O
Kim O

Reputation: 621

Python: Why would numpy.corrcoef() return NaN values?

Why would numpy.corrcoef() return NaN values?

I am working with high dimensional data and it is infeasible to go through every datum to test values.

# Import
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Delete all zero columns
df = df.loc[:, (df != 0).any(axis=0)]

# Delete any NaN columns
df = df.dropna(axis='columns', how='any', inplace=False)

# Standardise
X_std = StandardScaler().fit_transform(df.values)
print(X_std.dtype) # Returns "float64"

# Correlation
cor_mat1 = np.corrcoef(X_std.T)
cor_mat1.max() # Returns nan

And then cor_mat1.max()

Returns

nan

When computing cor_mat1 = np.corrcoef(X_std.T) I get this warning:

/Users/kimrants/anaconda3/lib/python3.6/site-packages/numpy/lib/function_base.py:3183: RuntimeWarning:

invalid value encountered in true_divide

This is a snippet of the input dataframe: enter image description here

To try and fix it myself, I started removing all zero-columns and columns that contained any NaN values. I thought this would solve the problem, but it didn't. Am I missing something? I don't see why else it would return NaN values?

My end goal is to compute eigen-values and -vectors.

Upvotes: 2

Views: 9728

Answers (2)

Nic Moetsch
Nic Moetsch

Reputation: 926

If you have a column where all rows have the same value, that column's variance is 0. np.corrcoef() thus numpy-divides that column's correlation coefficients by 0, which doesn't throw an error but only the warning invalid value encountered in true_divide with standard numpy settings. Those column's correlation coefficients get replaced by 'nan' instead:

import numpy as np

print(np.divide(0,0))

C:\Users\anaconda3\lib\site-packages\ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in true_divide
  """Entry point for launching an IPython kernel.
nan

Removing all columns with Series.nunique() == 1 should solve your problem.

Upvotes: 3

Kim O
Kim O

Reputation: 621

This solved the problem for reasons I cannot explain:

# Import
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Delete any NaN columns
df = df.dropna(axis='columns', how='any', inplace=False)

# Keep track of index / columns to reproduce dataframe
cols = df.columns
index = df.index

# Standardise
X_std = StandardScaler().fit_transform(df.values)
X_std = StandardScaler().fit_transform(X_std)
print(X_std.dtype) # Return "float64"

# Turn to dataFrame again to drop values easier
df = pd.DataFrame(data=X_std, columns= cols, index=index) 

# Delete all zero columns
df = df.loc[:, (df != 0).any(axis=0)]

# Delete any NaN columns
df = df.dropna(axis='columns', how='any', inplace=False)

The standardising two times in a row works, but is weird.

Upvotes: 0

Related Questions