Reputation: 300
How to scale the scikit-learn function MinMaxScaler if I have a big array ? So let's define the following
import numpy as np
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
and let's consider the following datasets
Y_train # shape = (2358331,1)
X_train # shape = (2358331,302)
Now I can scale my Y_train using
%%time
Y_train = scaler.fit_transform(Y_train)
it works fine and I get
CPU times: user 36.3 ms, sys: 104 ms, total: 141 ms
Wall time: 388 ms
But when I use the same command for X_train, it takes forever, it seems that the execution time is not linear with respect to the number of column. So I tried to use a loop to execute the scaler for each feature.
for i in range(X_train.shape[1]):
scaled_feature =
scaler.fit_transform(X_train[:,i].reshape(X_train.shape[0],1))
X_train[:,i] = scaled_feature.reshape(X_train.shape[0],)
But it's also endless.
My question is why is it that way ? And do you have an alternative for this problem ?
Upvotes: 2
Views: 2120
Reputation: 8814
Your problem stems from the fact that you're operating on a huge amount of data.
MinMaxScaler
takes a parameter copy
, which is True
by default. That means it'll make a copy of your data. And your data is huge. Assuming conservatively that every data point is a 32-bit integer, X_train
is about 2.8 gigabytes. All of that is getting funneled into memory and copied. You're getting stuck in the copying phase because of thrashing.
How do you mitigate this?
copy=False
.numpy.memmap
, which lets you access large data stored in memory.Upvotes: 2