Jayashree Gowda
Jayashree Gowda

Reputation: 99

How to Normalize data with NaN values in python

The data I am using has some null values and I want to impute the Null values using knn Imputation. In order to effectively impute I want to Normalize the data.

normalizer = Normalizer() #from sklearn.preprocessing
normalizer.fit_transform(data[num_cols]) #columns with numeric value

Error: Input contains NaN, infinity or a value too large for dtype('float64').

So how do I normalize data that is having NaN

Upvotes: 9

Views: 14826

Answers (3)

Tapio
Tapio

Reputation: 1642

sklearn.preprocessing.Normalizer is not about 0 mean, 1 stdev normalization like the other answers to date. Normalizer() is about scaling rows to unit norm e.g. to improve clustering or the original questions imputation. You can read about the differences here and here. For scaling rows you could try something like this:

import numpy as np

A = np.array([[  7,     4,   5,  7000],
              [  1,   900,   9,   nan],
              [  5, -1000, nan,   100],
              [nan,   nan,   3,  1000]])

#Compute NaN-norms
L1_norm = np.nansum(np.abs(A), axis=1)
L2_norm = np.sqrt(np.nansum(A**2, axis=1))
max_norm = np.nanmax(np.abs(A), axis=1)

#Normalize rows
A_L1 =  A / L1_norm[:,np.newaxis] # A.values if Dataframe
A_L2 =  A / L2_norm[:,np.newaxis]
A_max = A / max_norm[:,np.newaxis]

#Check that it worked
L1_norm_after = np.nansum(np.abs(A_L1), axis=1)
L2_norm_after = np.sqrt(np.nansum(A_L2**2, axis=1))
max_norm_after = np.nanmax(np.abs(A_max), axis=1)

 In[182]: L1_norm_after
Out[182]: array([1., 1., 1., 1.])

 In[183]: L2_norm_after
Out[183]: array([1., 1., 1., 1.])

 In[184]: max_norm_after
Out[184]: array([1., 1., 1., 1.])

If Google brought you here (like me) and you want to normalize columns to 0 mean, 1 stdev using the estimator API you can use sklearn.preprocessing.StandardScaler. It can handle NaNs (Tested on sklearn 0.20.2, I remember it didn't work on some older versions).

from numpy import nan, nanmean
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

A = [[  7,     4,   5,  7000],
     [  1,   900,   9,   nan],
     [  5, -1000, nan,   100],
     [nan,   nan,   3,  1000]]

scaler.fit(A)

In [45]: scaler.mean_
Out[45]: array([4.33333333,  -32.,    5.66666667, 2700.])

In [46]: scaler.transform(A)
Out[46]: array([[ 1.06904497,  0.04638641, -0.26726124,  1.40399977],
                [-1.33630621,  1.20089267,  1.33630621,         nan],
                [ 0.26726124, -1.24727908,         nan, -0.84893009],
                [        nan,         nan, -1.06904497, -0.55506968]])

In [54]: nanmean(scaler.transform(A), axis=0)
Out[54]: array([ 1.48029737e-16,  0.00000000e+00, -1.48029737e-16,0.00000000e+00])

Upvotes: 3

jz0410
jz0410

Reputation: 151

This method normalize all the columns to [0,1], and NaN remains being NaN

def norm_to_zero_one(df):
    return (df - df.min()) * 1.0 / (df.max() - df.min())

For example:

[In]
df = pd.DataFrame({'A': [10, 20, np.nan, 30],
                   'B': [1, np.nan, 10, 5]})
df = df.apply(norm_to_zero_one)
[Out]
     A         B
0  0.0  0.000000
1  0.5       NaN
2  NaN  1.000000
3  1.0  0.444444

df.max() and df.min() return the max and min of each column.

Upvotes: 1

Sociopath
Sociopath

Reputation: 13401

I would suggest not to use normalize in sklearn as it does not deal with NaNs. You can simply use below code to normalize your data.

df['col']=(df['col']-df['col'].min())/(df['col'].max()-df['col'].min())

Above method ignores NaNs while noramlizing the data

Upvotes: 4

Related Questions