Reputation: 139
In earlier versions of sklearn's MinMaxScaler one could specify the minimum and maximum values based on which the scaler would normalize the data. In other words, the following was possible:
from sklearn import preprocessing
import numpy as np
x_data = np.array([[66,74,89], [1,44,53], [85,86,33], [30,23,80]])
scaler = preprocessing.MinMaxScaler()
scaler.fit ([-90, 90])
b = scaler.transform(x_data)
This would cause the array above to be scaled to the range of (0,1) with the minimum possible value of -90 becoming 0, the maximum possible value of 90 becoming 1 and with all the values in-between getting scaled accordingly. With version 0.21 of sklearn this throws an error:
ValueError: Expected 2D array, got 1D array instead:
array=[-90. 90.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I turned scaler.fit ([-90, 90])
to scaler.fit ([[-90, 90]])
, but then I got:
ValueError: operands could not be broadcast together with shapes (4,3) (2,) (4,3)
I know for a fact that I can do scaler.fit (x_data)
, but this leads to the following result after tranform:
[0. 0.33333333 0.35714286]
[1. 1. 0. ]
[0.3452381 0. 0.83928571]]
My issue with that is twofold: 1) the numbers do not seem to be correct. They were supposed to be scaled between 0 and 1, but I get many 0s and many 1s for values that should be higher and lower respectively. 2) what if I want to scale every future array to a range of (0,1) based on a fixed range of, say, (-90. 90)? This was a convenient feature, but now I have to use a specific array to do my scaling. What is more, the scaling will produce different results every time because I will have to fit every future array anew, thus receiving variable results.
Am I missing something here? Is there a way to keep this nifty feature? And if there isn't, how will I make sure my data is scaled correcty and consistently every time?
Upvotes: 4
Views: 5133
Reputation: 3308
It seems that the problem is not in the scikit-learn
package version but in the shape of input data for fit()
method of MinMaxScaler
object:
import numpy as np
import sklearn
from sklearn.preprocessing import MinMaxScaler
print('scikit-learn package version: {}'.format(sklearn.__version__))
# scikit-learn package version: 0.21.3
scaler = MinMaxScaler()
x_sample = [-90, 90]
scaler.fit(np.array(x_sample)[:, np.newaxis]) # reshape data to satisfy fit() method requirements
x_data = np.array([[66,74,89], [1,44,53], [85,86,33], [30,23,80]])
print(scaler.transform(x_data))
# [[0.86666667 0.91111111 0.99444444]
# [0.50555556 0.74444444 0.79444444]
# [0.97222222 0.97777778 0.68333333]
# [0.66666667 0.62777778 0.94444444]]
To learn about input data requirements of such popular preprocessors like StandardScaler
, MinMaxScaler
etc. you can see my answer to another problem with StandardScaler.fit()
input.
Upvotes: 4