Reputation: 621
Why would numpy.corrcoef()
return NaN
values?
I am working with high dimensional data and it is infeasible to go through every datum to test values.
# Import
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
# Delete all zero columns
df = df.loc[:, (df != 0).any(axis=0)]
# Delete any NaN columns
df = df.dropna(axis='columns', how='any', inplace=False)
# Standardise
X_std = StandardScaler().fit_transform(df.values)
print(X_std.dtype) # Returns "float64"
# Correlation
cor_mat1 = np.corrcoef(X_std.T)
cor_mat1.max() # Returns nan
And then
cor_mat1.max()
Returns
nan
When computing cor_mat1 = np.corrcoef(X_std.T)
I get this warning:
/Users/kimrants/anaconda3/lib/python3.6/site-packages/numpy/lib/function_base.py:3183: RuntimeWarning:
invalid value encountered in true_divide
This is a snippet of the input dataframe:
To try and fix it myself, I started removing all zero-columns and columns that contained any NaN
values. I thought this would solve the problem, but it didn't. Am I missing something? I don't see why else it would return NaN
values?
My end goal is to compute eigen-values and -vectors.
Upvotes: 2
Views: 9728
Reputation: 926
If you have a column where all rows have the same value, that column's variance is 0
. np.corrcoef()
thus numpy-divides that column's correlation coefficients by 0
, which doesn't throw an error but only the warning invalid value encountered in true_divide
with standard numpy settings. Those column's correlation coefficients get replaced by 'nan' instead:
import numpy as np
print(np.divide(0,0))
C:\Users\anaconda3\lib\site-packages\ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in true_divide
"""Entry point for launching an IPython kernel.
nan
Removing all columns with Series.nunique() == 1
should solve your problem.
Upvotes: 3
Reputation: 621
This solved the problem for reasons I cannot explain:
# Import
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
# Delete any NaN columns
df = df.dropna(axis='columns', how='any', inplace=False)
# Keep track of index / columns to reproduce dataframe
cols = df.columns
index = df.index
# Standardise
X_std = StandardScaler().fit_transform(df.values)
X_std = StandardScaler().fit_transform(X_std)
print(X_std.dtype) # Return "float64"
# Turn to dataFrame again to drop values easier
df = pd.DataFrame(data=X_std, columns= cols, index=index)
# Delete all zero columns
df = df.loc[:, (df != 0).any(axis=0)]
# Delete any NaN columns
df = df.dropna(axis='columns', how='any', inplace=False)
The standardising two times in a row works, but is weird.
Upvotes: 0