Reputation: 51
I first use train_test_split to separate the train and test data, code:
X=LOG.iloc[:,:-3]
y=LOG.iloc[:,-3]
X_train,X_test,y_train, y_test=train_test_split(X,y)
scaler=MinMaxScaler().fit(X)
X_train_scaled=scaler.transform(X_train)
X_test_scaled=scaler.transform(X_test)
for thisalpha in [0.1,1,10]:
mlpreg=MLPRegressor(hidden_layer_sizes=(11,8,4),
activation ="tanh",
alpha = thisalpha,
solver ="lbfgs",max_iter=20000).fit(X_train_scaled, y_train)
y_test_predict = mlpreg.predict(X_test_scaled)
y_train_predict= mlpreg.predict(X_train_scaled)
print "aipha = {}, train score= {:.4f}, test score = {:.4f}, iter_number={}, loss={:.4f}".format(
thisalpha,
mlpreg.score(X_train_scaled,y_train),
mlpreg.score(X_test_scaled,y_test),
mlpreg.n_iter_,
mlpreg.loss_)
I get performance like this:
aipha = 0.1, train score= 0.7696, test score = 0.7358
aipha = 1, train score= 0.7419, test score = 0.7219
aipha = 10, train score= 0.6414, test score = 0.6494
Then I tried to use cross-validation to test the same dataset , I get much lower score:
X=LOG.iloc[:,:-3]
y=LOG.iloc[:,-3]
scaler= MinMaxScaler()
X_scaled=scaler.fit_transform(X)
clf=MLPRegressor(hidden_layer_sizes=(11,8,4),alpha=
1,solver="lbfgs",max_iter=20000)
scores = cross_val_score(clf,X_scaled,y,cv=3)
print scores
The cross_val_score are:
[0.04719619 0.36858483 0.36004186]
Upvotes: 0
Views: 254
Reputation: 51
I found where the problems are. My data are actually put in a "stack" way,: all the class one is on the top, and then class n in the bottom. So it gives me weird results. I changed my code like this, I need to shuffle the data in the first place, and then use the cross-validation method.
kfold = KFold(n_splits=3,shuffle=True,random_state=0)
X_scaled=scaler.fit_transform(X)
clf=MLPRegressor(hidden_layer_sizes=(11,8,4),alpha= 1,solver="lbfgs",max_iter=20000)
scores = cross_val_score(clf,X_scaled,y,cv=kfold)
print scores
The I get the scores like this:
[0.68697805 0.70411961 0.69466066]
Upvotes: 1
Reputation: 12923
Looking at your code, perhaps this is because you left out activation="tanh"
when running the cross validation models. Otherwise, the only real difference I can see is that you're testing on 25% of the data in the first case compared to 33% in the 2nd. That would not impact the accuracy as dramatically as you show.
Note that you should not use the validation/testing set to fit the scaler, since that's exposing the model (indirectly) to your testing data. This is easy to fix in the first case but more difficult to handle when using cross_val_score
.
Upvotes: 0