Reputation: 426
I am using scikit for my machine learning purposes . While I followed the steps exactly as mentioned in its official documentation but I encounter two problems. Here is the main part of the code :
1) trdata is training data created using sklearn.train_test_split. 2) ptest and ntest is test data of positives and negatives respectively
## Preprocessing
scaler = StandardScaler(); scaler.fit(trdata);
trdata = scaler.transform(trdata)
ptest = scaler.transform(ptest); ntest = scaler.transform(ntest)
## Building Classifier
# setting gamma and C for grid search optimization, RBF Kernel and SVM classifier
crange = 10.0**np.arange(-2,9); grange = 10.0**np.arange(-5,4)
pgrid = dict(gamma = grange, C = crange)
cv = StratifiedKFold(y = tg, n_folds = 3)
## Threshold Ranging
clf = GridSearchCV(SVC(),param_grid = pgrid, cv = cv, n_jobs = 8)
## Training Classifier: Semi Supervised Algorithm
clf.fit(trdata,tg,n_jobs=8)
Problem 1) When I use n_jobs = 8 in GridSearchCV, the code runs till GridSearchCV but hangs or say takes exceptionally long time without result in executing 'clf.fit' , even for a very small dataset. When I remove it then both execute but clf.fit takes very long time to converge for large datasets. My data size is 600 x 12 matrix for both positive and negatives. Can you tell me what exactly n_jobs will do and how it should be used? Also is there any faster fitting technique or modification in code that can be applied to make it faster ?
Problem 2) also StandardScaler should be used upon positive and negative data combined or separately for both ? I suppose it has to be used combined because then only we can use the scaler parameters upon the test sets.
Upvotes: 3
Views: 735
Reputation: 12689
SVC seems to be very sensitive to the data that is not normalized, you may try to normalize the data by:
from sklearn import preprocessing
trdata = preprocessing.scale(trdata)
Upvotes: 3