beginner_
beginner_

Reputation: 7622

scikit-learn: Get selected features for prediction data

I have a training set of data. The python script for creating the model also calculates the attributes into a numpy array (It's a bit vector). I then want to use VarianceThreshold to eliminate all features that have 0 variance (eg. all 0 or 1). I then run get_support(indices=True) to get the indices of the select columns.

My issue now is how to get only the selected features for the data I want to predict. I first calculate all features and then use array indexing but it does not work:

x_predict_all = getAllFeatures(suppl_predict)
x_predict = x_predict_all[indices] #only selected features

indices is a numpy array.

The returned array x_predict has the correct length len(x_predict) but wrong shape x_predict.shape[1] which is still the original length. My classifier then throws an error due to wrong shape

prediction = gbc.predict(x_predict)

  File "C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py", li
ne 1032, in _init_decision_function
    self.n_features, X.shape[1]))
ValueError: X.shape[1] should be 1855, not 2090.

How can I solve this issue?

Upvotes: 1

Views: 1840

Answers (1)

elyase
elyase

Reputation: 40963

You can do it like this:

Test data

from sklearn.feature_selection import VarianceThreshold

X = np.array([[0, 2, 0, 3], 
              [0, 1, 4, 3],  
              [0, 1, 1, 3]])
selector = VarianceThreshold()

Alternative 1

>>> selector.fit(X)
>>> idxs = selector.get_support(indices=True)
>>> X[:, idxs]
array([[2, 0],
       [1, 4],
       [1, 1]])

Alternative 2

>>> selector.fit_transform(X)
array([[2, 0],
       [1, 4],
       [1, 1]])

Upvotes: 3

Related Questions