Rookie_123
Rookie_123

Reputation: 2017

learning curve Sklearn

I was trying Random Forest Algorithm on Boston dataset to predict the house prices medv with the help of sklearn's RandomForestRegressor.

Below is my Train/Test split of data:

'''Train Test Split of Data'''
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

Dimensions of Train/Test split X.shape: (489, 11) X_train.shape: (366, 11) X_test.shape: (123, 11)

Below is my tuned random forest model:

#1. import the class/model
from sklearn.ensemble import RandomForestRegressor

#2. Instantiate the estimator
RFReg = RandomForestRegressor(max_features = 'auto', random_state = 1, n_jobs = -1, max_depth = 14, min_samples_split = 2, n_estimators = 550) 

#3. Fit the model with data aka model training
RFReg.fit(X_train, y_train)

#4. Predict the response for a new observation
y_pred = RFReg.predict(X_test)


y_pred_train = RFReg.predict(X_train)

Just to evaluate how good is the model performing I tried sklearn's learning curve with below code

train_sizes = [1, 25, 50, 100, 200, 390] # 390 is 80% of shape(X)

from sklearn.model_selection import learning_curve
def learning_curves(estimator, X, y, train_sizes, cv):
    train_sizes, train_scores, validation_scores = learning_curve(
                                                 estimator, X, y, train_sizes = train_sizes,
                                                 cv = cv, scoring = 'neg_mean_squared_error')
    #print('Training scores:\n\n', train_scores)
    #print('\n', '-' * 70) # separator to make the output easy to read
    #print('\nValidation scores:\n\n', validation_scores)
    train_scores_mean = -train_scores.mean(axis = 1)
    print(train_scores_mean)
    validation_scores_mean = -validation_scores.mean(axis = 1)
    print(validation_scores_mean)

    plt.plot(train_sizes, train_scores_mean, label = 'Training error')
    plt.plot(train_sizes, validation_scores_mean, label = 'Validation error')

    plt.ylabel('MSE', fontsize = 14)
    plt.xlabel('Training set size', fontsize = 14)
    title = 'Learning curves for a ' + str(estimator).split('(')[0] + ' model'
    plt.title(title, fontsize = 18, y = 1.03)
    plt.legend()
    plt.ylim(0,40)

If you notice I have passed X, y and not X_train, y_train to learning_curve.

I had below questions regarding learning_curve

  1. I am just not understanding is passing entire dataset instead of only train subset is correct or not
  2. Does the size of test data set varies according to the size of train dataset as mentioned in list train_sizes or it is always fixed (which would be 25% in my case according to train/test split which is 123 samples) for example

    • When train dataset size = 1 the will the test data size be 488 or will it be 123(the size of X_test)
    • When train dataset size = 25 the will the test data size be 464 or will it be 123(the size of X_test)
    • When train dataset size = 50 the will the test data size be 439 or will it be 123(the size of X_test)

I am bit confused about the sizes of train/test in learning_curve function

Upvotes: 3

Views: 1834

Answers (1)

Alessandro
Alessandro

Reputation: 865

You definitely want to use only your training test, so call the function this way, the reason is that you want to see how the learning is happening with the data you are actually using:

learning_curves(estimator=RFReg, X=X_train, y=y_size, train_sizes= train_sizes)

Upvotes: -2

Related Questions