Reputation: 2017
I was trying Random Forest Algorithm on Boston dataset to predict the house prices medv
with the help of sklearn's RandomForestRegressor.
Below is my Train/Test split of data:
'''Train Test Split of Data'''
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
Dimensions of Train/Test split
X.shape: (489, 11)
X_train.shape: (366, 11)
X_test.shape: (123, 11)
Below is my tuned random forest model:
#1. import the class/model
from sklearn.ensemble import RandomForestRegressor
#2. Instantiate the estimator
RFReg = RandomForestRegressor(max_features = 'auto', random_state = 1, n_jobs = -1, max_depth = 14, min_samples_split = 2, n_estimators = 550)
#3. Fit the model with data aka model training
RFReg.fit(X_train, y_train)
#4. Predict the response for a new observation
y_pred = RFReg.predict(X_test)
y_pred_train = RFReg.predict(X_train)
Just to evaluate how good is the model performing I tried sklearn's learning curve with below code
train_sizes = [1, 25, 50, 100, 200, 390] # 390 is 80% of shape(X)
from sklearn.model_selection import learning_curve
def learning_curves(estimator, X, y, train_sizes, cv):
train_sizes, train_scores, validation_scores = learning_curve(
estimator, X, y, train_sizes = train_sizes,
cv = cv, scoring = 'neg_mean_squared_error')
#print('Training scores:\n\n', train_scores)
#print('\n', '-' * 70) # separator to make the output easy to read
#print('\nValidation scores:\n\n', validation_scores)
train_scores_mean = -train_scores.mean(axis = 1)
print(train_scores_mean)
validation_scores_mean = -validation_scores.mean(axis = 1)
print(validation_scores_mean)
plt.plot(train_sizes, train_scores_mean, label = 'Training error')
plt.plot(train_sizes, validation_scores_mean, label = 'Validation error')
plt.ylabel('MSE', fontsize = 14)
plt.xlabel('Training set size', fontsize = 14)
title = 'Learning curves for a ' + str(estimator).split('(')[0] + ' model'
plt.title(title, fontsize = 18, y = 1.03)
plt.legend()
plt.ylim(0,40)
If you notice I have passed X, y
and not X_train, y_train
to learning_curve
.
I had below questions regarding learning_curve
train subset
is correct or notDoes the size of test data set varies according to the size of train dataset as mentioned in list train_sizes
or it is always fixed (which would be 25% in my case according to train/test split which is 123 samples) for example
train dataset size = 1
the will the test data size be 488 or will it be 123(the size of X_test)train dataset size = 25
the will the test data size be 464 or will it be 123(the size of X_test)train dataset size = 50
the will the test data size be 439 or will it be 123(the size of X_test)I am bit confused about the sizes of train/test in learning_curve
function
Upvotes: 3
Views: 1834
Reputation: 865
You definitely want to use only your training test, so call the function this way, the reason is that you want to see how the learning is happening with the data you are actually using:
learning_curves(estimator=RFReg, X=X_train, y=y_size, train_sizes= train_sizes)
Upvotes: -2