Reputation: 19
I don't understand why one has to use the fit_transform
method when the transform
method can give the same the output as using only fit transform method, whats the whole point of fit
method?
I have printed the x_train
and x_test
, both of them gave similar output.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[:, 3:] = sc.fit_transform(x_train[:, 3:])
x_test[:, 3:] = sc.transform(x_test[:, 3:])
Upvotes: 2
Views: 1497
Reputation: 1615
So in scickit learn preprocessors you often always have a fit
, a transform
and a 'fit_transform` method.
The differences are as follow:
fit
kind of learns
the structure of your data to find out categories that exist in it and other preprocessing information. Once you have fitted your preprocessor, you can then use that fitted preprocessor to transform
your data using that fitting
information. Let's take a simple example:
import numpy as np
from sklearn.preprocessing import StandardScaler
X_train = np.array([[1, 2], [3, 4], [5, 6]])
X_test = np.array([[7, 8], [9, 10]])
X_train:
array([[1, 2],
[3, 4],
[5, 6]])
X_test:
array([[ 7, 8],
[ 9, 10]])
Here you are preparing a standard scaler object
sc = StandardScaler()
This object must have some parameters holding information like the mean of the data and so on But since it hasn't yet seen any data, this mean value doesn't exist yet, so the following code is going to shown an error
print(sc.mean_)
AttributeError: 'StandardScaler' object has no attribute 'mean_'
Now let's use it to fit X_train data
sc.fit(X_train)
Let's see what happened after this operation
print(sc.mean_)
[3. 4.]
Now we can see that our standard scaler object has computed the mean of the data he's seen and stored it in one of its attributes which is here mean_
So this is basically to role of the fit
method: it is to find parameters about some data, in our case it is the training data.
Why we would want to find those parameters first is because we might want to reuse them exactly to transform other data.
That's where comes in the transform
method.
The transform method uses the 'learned'
parameters of some previous data to transform some new data.
So that in our case we can now transform our test data. This is because the train an test data should be transformed the same way( with the same parameters like mean, etc)
sc.transform(X_test)
array([[2.44949 , 2.44949 ],
[3.674235, 3.674235]])
But ofcourse we should also transform the training data itself first !
sc.transform(X_train)
array([[-1.224745, -1.224745],
[ 0. , 0. ],
[ 1.224745, 1.224745]])
As you can notice, we have fitted
then transformed
our training data in a row, while we have only transformed
our test data without the need to fit it.
Fitting and transforming in a row is where the fit_transform
method comes in.
So that for the training data we can directly do:
X_train = sc.fit_transform(X_train)
array([[-1.224745, -1.224745],
[ 0. , 0. ],
[ 1.224745, 1.224745]])
This method fits the data then tranforms it. But you can't just transform data without having fit it.
Now that you have already fitted your training data using fit_transform
or just fit
, now you can just transform your test data with the same fitting information as for the training data.
Hope this was clear enough.
Upvotes: 1
Reputation: 385
What will happen if you do not call the sc.fit_transform() before sc.transform()? The latter will fail with the message:
NotFittedError: This StandardScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
The function fit_transform() does what would fit() followed by transform() would do.
You would use fit() alone if you would not be interested in the transformed values of the training set.
Upvotes: 2