Reputation: 747
I have a random forest model I built to predict if NFL teams will score more combined points than the line Vegas has set. The features I use are Total
- the total number of combined points Vegas thinks both teams will score, over_percentage
- the percentage of public bets on the over, and under_percentage
- the percentage of public bets on the under. The over means people are betting that both team's combined scores will be greater than the number Vegas sets and under means the combined score will go under the Vegas number. When I run my model I'm getting a confusion_matrix like this
and an accuracy_score of 76%. However, the predictions do not perform well. Right now I have it giving me the probability the classification will be 0. I'm wondering if there are parameters I can tune or solutions to prevent my model from overfitting. I have over 30K games in the training data set so I don't think lack of data is causing the issue.
Here is the code:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
training_data = pd.read_csv(
'/Users/aus10/NFL/Data/Betting_Data/Training_Data_Betting.csv')
test_data = pd.read_csv(
'/Users/aus10/NFL/Data/Betting_Data/Test_Data_Betting.csv')
df_model = training_data.dropna()
X = df_model.loc[:, ["Total", "Over_Percentage",
"Under_Percentage"]] # independent columns
y = df_model["Over_Under"] # target column
results = []
model = RandomForestClassifier(
random_state=1, n_estimators=500, min_samples_split=2, max_depth=30, min_samples_leaf=1)
n_estimators = [100, 300, 500, 800, 1200]
max_depth = [5, 8, 15, 25, 30]
min_samples_split = [2, 5, 10, 15, 100]
min_samples_leaf = [1, 2, 5, 10]
hyperF = dict(n_estimators=n_estimators, max_depth=max_depth,
min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf)
gridF = GridSearchCV(model, hyperF, cv=3, verbose=1, n_jobs=-1)
model.fit(X, y)
skf = StratifiedKFold(n_splits=2)
skf.get_n_splits(X, y)
StratifiedKFold(n_splits=2, random_state=None, shuffle=False)
for train_index, test_index in skf.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X, X
y_train, y_test = y, y
bestF = gridF.fit(X_train, y_train)
print(bestF.best_params_)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(round(accuracy_score(y_test, y_pred), 2))
index = 0
count = 0
while count < len(test_data):
team = test_data.loc[index].at['Team']
total = test_data.loc[index].at['Total']
over_perc = test_data.loc[index].at['Over_Percentage']
under_perc = test_data.loc[index].at['Under_Percentage']
Xnew = [[total, over_perc, under_perc]]
# make a prediction
ynew = model.predict_proba(Xnew)
# show the inputs and predicted outputs
results.append(
{
'Team': team,
'Over': ynew[0][0]
})
index += 1
count += 1
sorted_results = sorted(results, key=lambda k: k['Over'], reverse=True)
df = pd.DataFrame(sorted_results, columns=[
'Team', 'Over'])
writer = pd.ExcelWriter('/Users/aus10/NFL/Data/ML_Results/Over_Probability.xlsx', # pylint: disable=abstract-class-instantiated
engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False)
df.style.set_properties(**{'text-align': 'center'})
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.width', 1000)
writer.save()
And here are links the the google docs with the test and training data.
Upvotes: 0
Views: 3501
Reputation: 6799
You are splitting the data using train_test_split
by setting it totest_split=0.25
. The downside to this is that it randomly splits the data and completely ignores the distribution of the classes when doing so. Your model will suffer from sampling bias where the correct distribution of the data is not maintained across the train and test datasets.
In your train set the data could be skewed more towards a particular instance of the data compared to the test set and vice versa.
To overcome this you can use StratifiedKFoldCrossValidation
which maintains the distribution of the classes accordingly.
def kfold_(df):
df = pd.read_csv(file)
df["kfold"] = -1
df = df.sample(frac=1).reset_index(drop=True)
y= df.target.values
kf= model_selection.StratifiedKFold(n_splits=5)
for f, (t_, v_) in enumerate(kf.split(X=df, y=y)):
df.loc[v_, "kfold"] = f
This function should be run for each fold of the dataset that was created based on the previous function
def run(fold):
df = pd.read_csv(file)
df_train = df[df.kfold != fold].reset_index(drop=True)
df_valid= df[df.kfold == fold].reset_index(drop=True)
x_train = df_train.drop("label", axis = 1).values
y_train = df_train.label.values
x_valid = df_valid.drop("label", axis = 1).values
y_valid = df_valid.label.values
rf = RandomForestRegressor()
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
cv = 5, n_jobs = -1, verbose = 2)
grid_search.fit(x_train, y_train)
y_pred = model.predict(x_valid)
print(f"Fold: {fold}")
print(confusion_matrix(y_valid, y_pred))
print(classification_report(y_valid, y_pred))
print(round(accuracy_score(y_valid, y_pred), 2))
Moreover you should perform hyperparameter tuning to find the best parameters for you the other answer shows you how to do so.
Upvotes: 1
Reputation: 18367
There's a couple of things to note when using RandomForests. First of all you might want to use cross_validate
in order to measure the performance of your model.
Furthermore RandomForests can be regularized by tweaking the following parameters:
max_depth
: This is a parameter that controls the maximum depth of the trees. The bigger it is, there more parameters will have, remember that overfitting happens when there's an excess of parameters being fitted.min_samples_leaf
: Instead of decreasing max_depth
we can increase the minimum number of samples required to be at a leaf node, this will limit the growth of the trees too and prevent having leaves with very few samples (Overfitting!)max_features
: As previously mentioned, overfitting happens when there's abundance of parameters being fitted, the number of parameters hold a direct relationship with the number of features in the model, therefore limiting the amount of features in each tree will prove valuable to help control overfitting.Finally, you might want to try different values and approaches using GridSearchCV
to automatize and try different combinations:
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
rf_clf = RandomForestClassifier()
parameters = {'max_features':np.arange(5,10),'n_estimators':[500,1000,1500],'max_depth':[2,4,8,16]}
clf = GridSearchCV(rf_clf, parameters, cv = 5)
clf.fit(X,y)
This will a return a table with the performance of all the different models (given the combination of hyperparameter) which will allow you to find the best one easier.
Upvotes: 2