Reputation: 51

Much worse performance when use cross_val_scores, why?

I first use train_test_split to separate the train and test data, code:

X=LOG.iloc[:,:-3]
y=LOG.iloc[:,-3]
X_train,X_test,y_train, y_test=train_test_split(X,y)

scaler=MinMaxScaler().fit(X)
X_train_scaled=scaler.transform(X_train)
X_test_scaled=scaler.transform(X_test)

for thisalpha in [0.1,1,10]:
    mlpreg=MLPRegressor(hidden_layer_sizes=(11,8,4),
                    activation ="tanh",
                    alpha = thisalpha,
                    solver ="lbfgs",max_iter=20000).fit(X_train_scaled, y_train)

    y_test_predict = mlpreg.predict(X_test_scaled)
    y_train_predict= mlpreg.predict(X_train_scaled)
    print "aipha = {}, train score= {:.4f}, test score = {:.4f}, iter_number={}, loss={:.4f}".format(
        thisalpha,
        mlpreg.score(X_train_scaled,y_train),
        mlpreg.score(X_test_scaled,y_test),
        mlpreg.n_iter_,
        mlpreg.loss_)

I get performance like this:

aipha = 0.1, train score= 0.7696, test score = 0.7358

aipha = 1, train score= 0.7419, test score = 0.7219

aipha = 10, train score= 0.6414, test score = 0.6494

Then I tried to use cross-validation to test the same dataset , I get much lower score:

X=LOG.iloc[:,:-3]
y=LOG.iloc[:,-3]

scaler= MinMaxScaler()

X_scaled=scaler.fit_transform(X)
clf=MLPRegressor(hidden_layer_sizes=(11,8,4),alpha= 
1,solver="lbfgs",max_iter=20000)
scores = cross_val_score(clf,X_scaled,y,cv=3)     

print scores

The cross_val_score are:

[0.04719619 0.36858483 0.36004186]

Upvotes: 0

Answers (2)

Chunxiao Li

Reputation: 51

I found where the problems are. My data are actually put in a "stack" way,: all the class one is on the top, and then class n in the bottom. So it gives me weird results. I changed my code like this, I need to shuffle the data in the first place, and then use the cross-validation method.

kfold = KFold(n_splits=3,shuffle=True,random_state=0)

X_scaled=scaler.fit_transform(X)
clf=MLPRegressor(hidden_layer_sizes=(11,8,4),alpha= 1,solver="lbfgs",max_iter=20000)
scores = cross_val_score(clf,X_scaled,y,cv=kfold)

print scores

The I get the scores like this:

[0.68697805 0.70411961 0.69466066]

Upvotes: 1

Alex

Reputation: 12923

Looking at your code, perhaps this is because you left out activation="tanh" when running the cross validation models. Otherwise, the only real difference I can see is that you're testing on 25% of the data in the first case compared to 33% in the 2nd. That would not impact the accuracy as dramatically as you show.

Note that you should not use the validation/testing set to fit the scaler, since that's exposing the model (indirectly) to your testing data. This is easy to fix in the first case but more difficult to handle when using cross_val_score.

Upvotes: 0

Much worse performance when use cross_val_scores, why?

Answers (2)

Related Questions