jmamath
jmamath

Reputation: 300

scikit-learn MinMax scaler doesn't scale

How to scale the scikit-learn function MinMaxScaler if I have a big array ? So let's define the following

import numpy as np
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))

and let's consider the following datasets

Y_train # shape = (2358331,1)
X_train # shape = (2358331,302)

Now I can scale my Y_train using

%%time
Y_train = scaler.fit_transform(Y_train)

it works fine and I get

CPU times: user 36.3 ms, sys: 104 ms, total: 141 ms
Wall time: 388 ms

But when I use the same command for X_train, it takes forever, it seems that the execution time is not linear with respect to the number of column. So I tried to use a loop to execute the scaler for each feature.

for i in range(X_train.shape[1]):
  scaled_feature = 
    scaler.fit_transform(X_train[:,i].reshape(X_train.shape[0],1))
  X_train[:,i] = scaled_feature.reshape(X_train.shape[0],)

But it's also endless.
My question is why is it that way ? And do you have an alternative for this problem ?

Upvotes: 2

Views: 2120

Answers (1)

Arya McCarthy
Arya McCarthy

Reputation: 8814

Your problem stems from the fact that you're operating on a huge amount of data.

MinMaxScaler takes a parameter copy, which is True by default. That means it'll make a copy of your data. And your data is huge. Assuming conservatively that every data point is a 32-bit integer, X_train is about 2.8 gigabytes. All of that is getting funneled into memory and copied. You're getting stuck in the copying phase because of thrashing.

How do you mitigate this?

  1. Call the constructor with copy=False.
  2. If that’s not enough of an improvement, check out numpy.memmap, which lets you access large data stored in memory.

Upvotes: 2

Related Questions