Reputation: 2905
I'm trying to normalize data with missing (i.e. nan) values before processing it, using scikit-learn preprocessing.
Apparently, some scalers (e.g. StandardScaler) handle the missing values the way I want - by which I mean normalize the existing values while keeping the nans - while other (e.g. Normalizer) just raise an error.
I've looked around and haven't found - how can I use the Normalizer with missing values, or replicate its behavior (with norm='l1' and norm='l2'; I need to test several normalization options) some other way?
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
scaler = StandardScaler(with_mean=True, with_std=True)
scaler.fit_transform(data.reshape(-1,1))
normalizer = Normalizer(norm='l2')
normalizer.fit_transform(data.reshape(-1,1))
Upvotes: 4
Views: 6384
Reputation: 1809
The problem with your request is that Normalizer operates in this fashion, accordingly to documentation:
Normalize samples individually to unit norm.
Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one (source here)
That means that each row have to sum to unit norm. How to deal with a missing value? Ideally it seems you don't want it to count in the sum and you want the row to normalize regardless of it, but the internal function check_array prevents from it by throwing an error.
You need to circumvent such a situation. The most reasonable way to do it is to:
here some code detailing the process, based on your example:
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
# set valid mask
nan_mask = np.isnan(data)
valid_mask = ~nan_mask
normalizer = Normalizer(norm='l2')
# create a result array
result = np.full(data.shape, np.nan)
# assign only valid cases to
result[valid_mask] = normalizer.fit_transform(data[valid_mask].reshape(-1,1)).reshape(data[valid_mask].shape)
Upvotes: 4