Josh
Josh

Reputation: 12791

Feature selection with LinearSVC

When I try running the following code with my data (from this example)

X_new = LinearSVC(C=0.01, penalty="l1", dual=False).fit_transform(X, y)

I get:

"Invalid threshold: all features are discarded"

I tried specifying my own threshold:

clf = LinearSVC(C=0.01, penalty="l1", dual=False)
clf.fit(X,y)
X_new = clf.transform(X, threshold=my_threshold)

but I either get:

I can't post the entire matrix X, but below are a few stats of the data:

> X.shape 
Out: (29,312) 

> np.mean(X, axis=1)
Out: 
array([-0.30517191, -0.1147345 ,  0.03674294, -0.15926932, -0.05034101,
       -0.06357734, -0.08781186, -0.12865185,  0.14172452,  0.33640029,
        0.06778798, -0.00217696,  0.09097335, -0.17915627,  0.03701893,
       -0.1361117 ,  0.13132006,  0.14406628, -0.05081956,  0.20777349,
       -0.06028931,  0.03541849, -0.07100492,  0.05740661, -0.38585413,
        0.31837905,  0.14076042,  0.1182338 , -0.06903557])

> np.std(X, axis=1)                                               
Out: 
array([ 1.3267662 ,  0.75313658,  0.81796146,  0.79814621,  0.59175161,
        0.73149726,  0.8087903 ,  0.59901198,  1.13414141,  1.02433752,
        0.99884428,  1.11139231,  0.89254901,  1.92760784,  0.57181158,
        1.01322265,  0.66705546,  0.70248779,  1.17107696,  0.88254386,
        1.06930436,  0.91769016,  0.92915593,  0.84569395,  1.59371779,
        0.71257806,  0.94307434,  0.95083782,  0.88996455])

y = array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,
           0, 0, 0, 0, 0, 0])

This is all with scikit-learn 0.14.

Upvotes: 4

Views: 3498

Answers (1)

lejlot
lejlot

Reputation: 66805

You should first analyze if your SVM model is training well before trying to use it as a transformation base. It is possible, that you are using too small C parameter, which is causing sklearn to train a trivial model which leads to the removal of all features. You can check it by either performing classification tests on your data, or at least printing the found coefficients (clf.coef_)

It would be a good idea to run a grid search technique, for the best C in terms of generalization properties, and then use it for transformation.

Upvotes: 4

Related Questions