Reputation: 1610
In scikit-learn, all estimators have a fit()
method, and depending on whether they are supervised or unsupervised, they also have a predict()
or transform()
method.
I am in the process of writing a transformer for an unsupervised learning task and was wondering if there is a rule of thumb where to put which kind of learning logic. The official documentation is not very helpful in this regard:
fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
In this context, what is meant by both fitting data and transforming data?
Upvotes: 35
Views: 19065
Reputation: 6756
As other answers explain it, fit
does not need to be doing anything (except from returning the transformer object). It is there so that all transformers have the same interface and work nicely with stuff like pipelines.
Of course some transformers need a fit
method (think tf-idf, PCA...) that actually does things.
The transform
method needs to return the transformed data.
fit_transform
is a convenience method that chains the fit and transform operations. You can get it for free (!) by deriving your custom transformer class from TransformerMixin
and implementing fit
and transform
.
Upvotes: 8
Reputation: 1304
Fitting finds the internal parameters of a model that will be used to transform data. Transforming applies the parameters to data. You may fit a model to one set of data, and then transform it on a completely different set.
For example, you fit a linear model to data to get a slope and intercept. Then you use those parameters to transform (i.e., map) new or existing values of x
to y
.
fit_transform
is just doing both steps to the same data.
A scikit example: You fit data to find the principal components. Then you transform your data to see how it maps onto these components:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X = [[1,2],[2,4],[1,3]]
pca.fit(X)
# This is the model to map data
pca.components_
array([[ 0.47185791, 0.88167459],
[-0.88167459, 0.47185791]], dtype=float32)
# Now we actually map the data
pca.transform(X)
array([[-1.03896057, -0.17796634],
[ 1.19624651, -0.11592512],
[-0.15728599, 0.29389156]])
# Or we can do both "at once"
pca.fit_transform(X)
array([[-1.03896058, -0.1779664 ],
[ 1.19624662, -0.11592512],
[-0.15728603, 0.29389152]], dtype=float32)
Upvotes: 53
Reputation: 542
In this case, calling the fit
method does not do anything. As you can see in this example, not all transformers need to actually do something with fit
or transform
methods. My guess is that every class in scikit-learn should implement the fit, transform and/or predict in order for it to be consistent with the rest of the package. But I guess this is indeed quite an overkill.
Upvotes: 3