Reputation: 2411
I have a problem with fit_transform
function. Can someone explain why size of array different?
In [5]: X.shape, test.shape
Out[5]: ((1000, 1932), (1000, 1932))
In [6]: from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
features = sel.fit_transform(X)
features_test = sel.fit_transform(test)
In [7]: features.shape, features_test.shape
Out[7]:((1000, 1663), (1000, 1665))
UPD: Which transformation can help me get arrays with same sizes?
Upvotes: 3
Views: 3094
Reputation: 6756
It is because you are fitting your selector twice.
First, note that fit_transform
is just a call to fit
followed by a call to transform
.
The fit
method allows your VarianceThreshold
selector to find the features it wants to keep in the dataset based on the parameters you gave it.
The transform
method performs the actual feature selection and returns a n array with just the selected features.
Upvotes: 7
Reputation: 4101
Because fit_transform
applies a dimensionality reduction on the array. This is why the resulting arrays dimensions are not the same as the input.
See this what is the difference between 'transform' and 'fit_transform' in sklearn and this http://scikit-learn.org/stable/modules/feature_extraction.html
Upvotes: 0