How to prevent overfitting in Random Forest

I have a random forest model I built to predict if NFL teams will score more combined points than the line Vegas has set. The features I use are Total - the total number of combined points Vegas thinks both teams will score, over_percentage - the percentage of public bets on the over, and under_percentage - the percentage of public bets on the under. The over means people are betting that both team's combined scores will be greater than the number Vegas sets and under means the combined score will go under the Vegas number. When I run my model I'm getting a confusion_matrix like this

and an accuracy_score of 76%. However, the predictions do not perform well. Right now I have it giving me the probability the classification will be 0. I'm wondering if there are parameters I can tune or solutions to prevent my model from overfitting. I have over 30K games in the training data set so I don't think lack of data is causing the issue.

Here is the code:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

training_data = pd.read_csv(
    '/Users/aus10/NFL/Data/Betting_Data/Training_Data_Betting.csv')
test_data = pd.read_csv(
    '/Users/aus10/NFL/Data/Betting_Data/Test_Data_Betting.csv')

df_model = training_data.dropna()

X = df_model.loc[:, ["Total", "Over_Percentage",
                     "Under_Percentage"]]  # independent columns
y = df_model["Over_Under"]  # target column

results = []

model = RandomForestClassifier(
    random_state=1, n_estimators=500, min_samples_split=2, max_depth=30, min_samples_leaf=1)

n_estimators = [100, 300, 500, 800, 1200]
max_depth = [5, 8, 15, 25, 30]
min_samples_split = [2, 5, 10, 15, 100]
min_samples_leaf = [1, 2, 5, 10]

hyperF = dict(n_estimators=n_estimators, max_depth=max_depth,
              min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf)

gridF = GridSearchCV(model, hyperF, cv=3, verbose=1, n_jobs=-1)

model.fit(X, y)

skf = StratifiedKFold(n_splits=2)

skf.get_n_splits(X, y)

StratifiedKFold(n_splits=2, random_state=None, shuffle=False)

for train_index, test_index in skf.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X, X
    y_train, y_test = y, y

bestF = gridF.fit(X_train, y_train)

print(bestF.best_params_)

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(round(accuracy_score(y_test, y_pred), 2))

index = 0
count = 0

while count < len(test_data):
    team = test_data.loc[index].at['Team']
    total = test_data.loc[index].at['Total']
    over_perc = test_data.loc[index].at['Over_Percentage']
    under_perc = test_data.loc[index].at['Under_Percentage']

    Xnew = [[total, over_perc, under_perc]]
    # make a prediction
    ynew = model.predict_proba(Xnew)
    # show the inputs and predicted outputs
    results.append(
        {
            'Team': team,
            'Over': ynew[0][0]
        })
    index += 1
    count += 1

sorted_results = sorted(results, key=lambda k: k['Over'], reverse=True)

df = pd.DataFrame(sorted_results, columns=[
    'Team', 'Over'])
writer = pd.ExcelWriter('/Users/aus10/NFL/Data/ML_Results/Over_Probability.xlsx',  # pylint: disable=abstract-class-instantiated
                        engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False)
df.style.set_properties(**{'text-align': 'center'})
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.width', 1000)
writer.save()

And here are links the the google docs with the test and training data.

Test Data

Training Data

Upvotes: 0

Answers (2)

yudhiesh

Reputation: 6799

You are splitting the data using train_test_split by setting it totest_split=0.25. The downside to this is that it randomly splits the data and completely ignores the distribution of the classes when doing so. Your model will suffer from sampling bias where the correct distribution of the data is not maintained across the train and test datasets.

In your train set the data could be skewed more towards a particular instance of the data compared to the test set and vice versa.

To overcome this you can use StratifiedKFoldCrossValidation which maintains the distribution of the classes accordingly.

Creates K-Fold for the dataframe

def kfold_(df):
    df = pd.read_csv(file)
    df["kfold"] = -1
    df = df.sample(frac=1).reset_index(drop=True)
    y= df.target.values
    kf= model_selection.StratifiedKFold(n_splits=5)

    for f, (t_, v_) in enumerate(kf.split(X=df, y=y)):
        df.loc[v_, "kfold"] = f

This function should be run for each fold of the dataset that was created based on the previous function

def run(fold):
    df = pd.read_csv(file)
    df_train = df[df.kfold != fold].reset_index(drop=True)
    df_valid= df[df.kfold == fold].reset_index(drop=True)

    x_train = df_train.drop("label", axis = 1).values
    y_train = df_train.label.values

    x_valid = df_valid.drop("label", axis = 1).values
    y_valid = df_valid.label.values
    rf = RandomForestRegressor()
    grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 5, n_jobs = -1, verbose = 2)
    grid_search.fit(x_train, y_train)
    y_pred = model.predict(x_valid)
    print(f"Fold: {fold}")
    print(confusion_matrix(y_valid, y_pred))
    print(classification_report(y_valid, y_pred))
    print(round(accuracy_score(y_valid, y_pred), 2))

Moreover you should perform hyperparameter tuning to find the best parameters for you the other answer shows you how to do so.

Upvotes: 1

Celius Stingher

Reputation: 18367

There's a couple of things to note when using RandomForests. First of all you might want to use cross_validate in order to measure the performance of your model.

Furthermore RandomForests can be regularized by tweaking the following parameters:

Decreasing max_depth: This is a parameter that controls the maximum depth of the trees. The bigger it is, there more parameters will have, remember that overfitting happens when there's an excess of parameters being fitted.
Increasing min_samples_leaf: Instead of decreasing max_depth we can increase the minimum number of samples required to be at a leaf node, this will limit the growth of the trees too and prevent having leaves with very few samples (Overfitting!)
Decreasing max_features: As previously mentioned, overfitting happens when there's abundance of parameters being fitted, the number of parameters hold a direct relationship with the number of features in the model, therefore limiting the amount of features in each tree will prove valuable to help control overfitting.

Finally, you might want to try different values and approaches using GridSearchCV to automatize and try different combinations:

from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
rf_clf = RandomForestClassifier()
parameters = {'max_features':np.arange(5,10),'n_estimators':[500,1000,1500],'max_depth':[2,4,8,16]}
clf = GridSearchCV(rf_clf, parameters, cv = 5)
clf.fit(X,y)

This will a return a table with the performance of all the different models (given the combination of hyperparameter) which will allow you to find the best one easier.

Upvotes: 2

How to prevent overfitting in Random Forest

Answers (2)

Creates K-Fold for the dataframe

Related Questions