Justin Furlotte
Justin Furlotte

Reputation: 21

GridSearchCV Keeps Returning Smallest Alpha for Ridge Regression

I am trying to use data from moneypuck to predict the number of goals an NHL player scored in a given season. I don't expect anyone to look through this but if you want to skim, here is the (rather massive) list of features I kept from the dataset:

['team', 'position', 'games_played', 'icetime', 'shifts', 'gameScore', 'onIce_xGoalsPercentage', 'offIce_xGoalsPercentage', 'onIce_corsiPercentage', 'offIce_corsiPercentage', 'onIce_fenwickPercentage', 'offIce_fenwickPercentage', 'iceTimeRank', 'I_F_xOnGoal', 'I_F_xGoals', 'I_F_xRebounds', 'I_F_xFreeze', 'I_F_xPlayStopped', 'I_F_xPlayContinuedInZone', 'I_F_xPlayContinuedOutsideZone', 'I_F_flurryAdjustedxGoals', 'I_F_scoreVenueAdjustedxGoals', 'I_F_flurryScoreVenueAdjustedxGoals', 'I_F_shotsOnGoal', 'I_F_missedShots', 'I_F_blockedShotAttempts', 'I_F_shotAttempts', 'I_F_rebounds', 'I_F_freeze', 'I_F_playStopped', 'I_F_playContinuedInZone', 'I_F_playContinuedOutsideZone', 'penalties', 'I_F_penalityMinutes', 'I_F_faceOffsWon', 'I_F_hits', 'I_F_takeaways', 'I_F_giveaways', 'I_F_lowDangerShots', 'I_F_mediumDangerShots', 'I_F_highDangerShots', 'I_F_lowDangerxGoals', 'I_F_mediumDangerxGoals', 'I_F_highDangerxGoals', 'I_F_scoreAdjustedShotsAttempts', 'I_F_unblockedShotAttempts', 'I_F_scoreAdjustedUnblockedShotAttempts', 'I_F_dZoneGiveaways', 'I_F_xGoalsFromxReboundsOfShots', 'I_F_xGoalsFromActualReboundsOfShots', 'I_F_reboundxGoals', 'I_F_xGoals_with_earned_rebounds', 'I_F_xGoals_with_earned_rebounds_scoreAdjusted', 'I_F_xGoals_with_earned_rebounds_scoreFlurryAdjusted', 'I_F_shifts', 'I_F_oZoneShiftStarts', 'I_F_dZoneShiftStarts', 'I_F_neutralZoneShiftStarts', 'I_F_flyShiftStarts', 'I_F_oZoneShiftEnds', 'I_F_dZoneShiftEnds', 'I_F_neutralZoneShiftEnds', 'I_F_flyShiftEnds', 'faceoffsWon', 'faceoffsLost', 'timeOnBench', 'penalityMinutes', 'penalityMinutesDrawn', 'penaltiesDrawn', 'shotsBlockedByPlayer', 'OnIce_F_xOnGoal', 'OnIce_F_xGoals', 'OnIce_F_flurryAdjustedxGoals', 'OnIce_F_scoreVenueAdjustedxGoals', 'OnIce_F_flurryScoreVenueAdjustedxGoals', 'OnIce_F_shotsOnGoal', 'OnIce_F_missedShots', 'OnIce_F_blockedShotAttempts', 'OnIce_F_shotAttempts', 'OnIce_F_rebounds', 'OnIce_F_lowDangerShots', 'OnIce_F_mediumDangerShots', 'OnIce_F_highDangerShots', 'OnIce_F_lowDangerxGoals', 'OnIce_F_mediumDangerxGoals', 'OnIce_F_highDangerxGoals', 'OnIce_F_scoreAdjustedShotsAttempts', 'OnIce_F_unblockedShotAttempts', 'OnIce_F_scoreAdjustedUnblockedShotAttempts', 'OnIce_F_xGoalsFromxReboundsOfShots', 'OnIce_F_xGoalsFromActualReboundsOfShots', 'OnIce_F_reboundxGoals', 'OnIce_F_xGoals_with_earned_rebounds', 'OnIce_F_xGoals_with_earned_rebounds_scoreAdjusted', 'OnIce_F_xGoals_with_earned_rebounds_scoreFlurryAdjusted', 'OffIce_F_xGoals', 'OffIce_A_xGoals', 'OffIce_F_shotAttempts', 'OffIce_A_shotAttempts', 'xGoalsForAfterShifts', 'xGoalsAgainstAfterShifts', 'corsiForAfterShifts', 'corsiAgainstAfterShifts', 'fenwickForAfterShifts', 'fenwickAgainstAfterShifts']

(Note: I_F_ means "individual for" and OneIce_F_ means "on ice for". The lower case "x" means "expected".)

I've used one hot encoding on the team and position features using a column transformer, and then built a pipeline to apply standard scaling on all features after the column transformer has been applied, ultimately followed by Ridge regression. But when trying to tune alpha in my Ridge model using GridSearchCV, the best cross-validation score is always given by the smallest alpha value I test. Even more strange, setting alpha=0.01, the model is able to guess the number of goals scored with an average absolute error of only 0.07, and if we round the predictions to whole numbers, it accurately predicts the number of goals of every player in the test set.

This is obviously nonsense, as the coefficients are enormous (I'm getting numbers like 433835277890.1854) but I can't understand why. I don't see any features in the list that would be "giving away" how many goals the player scored to the model. And just to be 100% sure of this, I check the largest coefficients learned by Ridge; they were OnIce_F_lowDangerShots and OnIce_F_shotsOnGoal for the largest positive and largest negative coefficients. So nothing that would give away how many goals the player scored.

Upvotes: 2

Views: 401

Answers (1)

from tpot import TPOTClassifier

tpot=TPOTClassifier(generations=3, population_size=5, verbosity=2, offspring_size=10, scoring='accuracy', cv=5)
tpot.fit(X_train,y_train)

print(tpot.score(X_test,y_test))


generations: iterations to run training
population_size: the number of models to keep after each iteration
offspring_size: number of models to produce in each iteration
mutation_rate: the proportion of pipelines to apply randomness
crossover_rate: the proportion of pipelines to breed each iteration
scoring: the function to determine the best models
cv: cross-validation strategy to use


TPOT is quite unstable when only running with low generations, population size and offspring.

suppose the best classifier is MLPClassifier

then use GridSearch to tune the parameters

stepwise refinement

another way to tune your parameter is using step wise refinement and see which parameter combinating improves accuracy by iterative trying them and then plotting the results.

preprocessing

you would normalize your data using standardscaler to remove noise and use simple imputer to fix nan. find out if any features can be removed by using logistic regression and analyzing the probabilities of the features.

Upvotes: 1

Related Questions