Reputation: 99
The data I am using has some null values and I want to impute the Null values using knn Imputation. In order to effectively impute I want to Normalize the data.
normalizer = Normalizer() #from sklearn.preprocessing
normalizer.fit_transform(data[num_cols]) #columns with numeric value
Error: Input contains NaN, infinity or a value too large for dtype('float64').
So how do I normalize data that is having NaN
Upvotes: 9
Views: 14826
Reputation: 1642
sklearn.preprocessing.Normalizer is not about 0 mean, 1 stdev normalization like the other answers to date. Normalizer() is about scaling rows to unit norm e.g. to improve clustering or the original questions imputation. You can read about the differences here and here. For scaling rows you could try something like this:
import numpy as np
A = np.array([[ 7, 4, 5, 7000],
[ 1, 900, 9, nan],
[ 5, -1000, nan, 100],
[nan, nan, 3, 1000]])
#Compute NaN-norms
L1_norm = np.nansum(np.abs(A), axis=1)
L2_norm = np.sqrt(np.nansum(A**2, axis=1))
max_norm = np.nanmax(np.abs(A), axis=1)
#Normalize rows
A_L1 = A / L1_norm[:,np.newaxis] # A.values if Dataframe
A_L2 = A / L2_norm[:,np.newaxis]
A_max = A / max_norm[:,np.newaxis]
#Check that it worked
L1_norm_after = np.nansum(np.abs(A_L1), axis=1)
L2_norm_after = np.sqrt(np.nansum(A_L2**2, axis=1))
max_norm_after = np.nanmax(np.abs(A_max), axis=1)
In[182]: L1_norm_after
Out[182]: array([1., 1., 1., 1.])
In[183]: L2_norm_after
Out[183]: array([1., 1., 1., 1.])
In[184]: max_norm_after
Out[184]: array([1., 1., 1., 1.])
If Google brought you here (like me) and you want to normalize columns to 0 mean, 1 stdev using the estimator API you can use sklearn.preprocessing.StandardScaler. It can handle NaNs (Tested on sklearn 0.20.2, I remember it didn't work on some older versions).
from numpy import nan, nanmean
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
A = [[ 7, 4, 5, 7000],
[ 1, 900, 9, nan],
[ 5, -1000, nan, 100],
[nan, nan, 3, 1000]]
scaler.fit(A)
In [45]: scaler.mean_
Out[45]: array([4.33333333, -32., 5.66666667, 2700.])
In [46]: scaler.transform(A)
Out[46]: array([[ 1.06904497, 0.04638641, -0.26726124, 1.40399977],
[-1.33630621, 1.20089267, 1.33630621, nan],
[ 0.26726124, -1.24727908, nan, -0.84893009],
[ nan, nan, -1.06904497, -0.55506968]])
In [54]: nanmean(scaler.transform(A), axis=0)
Out[54]: array([ 1.48029737e-16, 0.00000000e+00, -1.48029737e-16,0.00000000e+00])
Upvotes: 3
Reputation: 151
This method normalize all the columns to [0,1], and NaN remains being NaN
def norm_to_zero_one(df):
return (df - df.min()) * 1.0 / (df.max() - df.min())
For example:
[In]
df = pd.DataFrame({'A': [10, 20, np.nan, 30],
'B': [1, np.nan, 10, 5]})
df = df.apply(norm_to_zero_one)
[Out]
A B
0 0.0 0.000000
1 0.5 NaN
2 NaN 1.000000
3 1.0 0.444444
df.max()
and df.min()
return the max and min of each column.
Upvotes: 1
Reputation: 13401
I would suggest not to use normalize in sklearn as it does not deal with NaNs. You can simply use below code to normalize your data.
df['col']=(df['col']-df['col'].min())/(df['col'].max()-df['col'].min())
Above method ignores NaNs while noramlizing the data
Upvotes: 4