Vignesh
Vignesh

Reputation: 215

How can I restrict the output of an Amazon Machine Learning model? (Predicting cricket team results)

I am trying to predict match winner based on the historical data set as shown below,

data set

The data set comprises of IPL seasons and Team_Name_id vs Opponent Team are the team names in IPL. I have set the match id as Row id and created the model. When running realtime testing, the result is not as expected (shown below)

realtime testing

Target is set as Match_winner_id. Am I missing any configurations? Please help

Upvotes: 0

Views: 209

Answers (1)

John Rotenstein
John Rotenstein

Reputation: 270274

The model is working perfectly correctly. There's just two problems:

  • Your input data is not very good
  • There's no way for the model to know that only one of those two teams should win

Data Quality

A predictive model needs good quality input data on which to reverse-engineer a model that explains a given result. This input data should contain information that can be used to predict a result given a different set of input data.

For example, when predicting house prices, it would need to know the suburb (category), number of bedrooms/bathrooms/parking spaces, age of the building and selling price. It could then predict the selling price for other houses with a slightly different mix of variables.

However, based on your screenshot, you are giving the following information (and probably more) on which to make your prediction:

  • Teams: Not great, because you are separating Column C and Column D. The model will assume they are unrelated information. It doesn't realise that those two values could be swapped.
  • Match date: Useless information unless the outcome varies in proportion to time (eg a team continually gets better)
  • Season: As with Match Date, this is probably useless because you're always predicting the future -- you won't be predicting for a past season
  • Venue: Only relevant if a particular team always wins at a given venue
  • Toss Decision: Would this really influence the outcome? Also, it's only known once the game begins, so not great for predicting a future game.
  • Win Type: You won't know the win type until a game is over, so it's not suitable for predicting a future game.
  • Score: Again, not known until the actual game, so no good for future predictions.
  • Man of the Match: Not known for future games.
  • Umpire: How does an umpire influence the result of a game?
  • City: Yes, given that home teams often have an advantage.

You have provided very little information that could be used to predict a future game. There is really only the teams and the venue. Everything else is either part of the game itself or irrelevant.

Picking only one of the two teams

When the ML model looks at your data and tries to make a prediction, it will look at all the data you have provided. For example, it might notice that for a given venue and season, Team 8 has a higher propensity to win. Therefore, given that venue and season, it will favour a win by Team 8. The model has no concept that the only possible outcome is one of the two teams given in columns C and D.

You are predicting for two given teams and you are listing the teams in either Column C or Column D and this makes no sense -- the result is the same if you swapped the teams between columns, but the model has no concept of this. Also, information about Team 1 vs Team 2 is totally irrelevant for Team 3 vs Team 4.

What you should do is create one dataset per team, listing all their matches, plus a column that shows the outcome -- either a boolean (Win/Lose) or a value that represents the number of runs by which they won (where negative is a loss). You would then ask them model to predict the result for that team, given the input data, which would be win/lose or a points above/below the other team.

But at the core, I think that your input data doesn't have enough rich content to be able to make a sensible prediction. Just ask yourself: "What data would I like to know if I were to guess which team would win?" It would probably be past results, weather conditions, which players were on each team, how many matches they played in the last week, etc. None of this information is being provided as input on each line of your input data.

Upvotes: 1

Related Questions