radio23
radio23

Reputation: 99

Multinomial Logistic Regression Predictors Set Up

I would like to use a multinomial logistic regression to get win probabilities for each of the 5 horses that participate in any given race using each horses previous average speed.

RACE_ID    H1_SPEED     H2_SPEED    H3_SPEED    H4_SPEED    H5_SPEED    WINNING_HORSE
1          40.482081    44.199627   42.034929   39.004813   43.830139   5
2          39.482081    42.199627   41.034929   41.004813   40.830139   4

I am stuck on how to handle the independent variables for each horse given that any of the 5 horses average speed can be placed in any of H1_SPEED through H5_SPEED.

Given the fact that for each race I can put any of the 5 horses under H1_SPEED meaning there is no real relationship between H1_SPEED from RACE_ID 1 and H1_SPEED from RACE_ID 2 other than the arbitrary position I selected.

Would there be any difference if the dataset looked like this -

RACE_ID    H1_SPEED     H2_SPEED    H3_SPEED    H4_SPEED    H5_SPEED    WINNING_HORSE
1          40.482081    44.199627   43.830139   39.004813   42.034929   3
2          41.004813    42.199627   41.034929   39.482081   40.830139   1

Is this an issue, if so how should this be handled? What if I wanted to add more independent features per horse?

Upvotes: 1

Views: 164

Answers (1)

Andrea
Andrea

Reputation: 77

You cannot change in that way your dataset, because each feature (column) has a meaning and probably it depends on the values of the other features. You can imagine it as a six dimensional hyperplane, if you change the value of a feature the position of the point in the hyperplane changes, it does not remain stationary.
If you deem that a feature is useless to solve your problem (i.e. it is independent from the target), you can drop it or avoid to use it during the training phase of your model.

Edit

To solve your specific problem you may add a parameter for each speed column that takes care of the specific horse which is running with that speed. It is a sort of data augmentation, in order to add more problem related features to your model.

RACE_ID   H1_SPEED  H1_HORSE   H2_SPEED  H2_HORSE  ... WINNING_HORSE
1         40.482081        1   44.199627        2  ...             5
2         39.482081        3   42.199627        5  ...             4

I've invented the number associated to each horse, but it seems that this information is present in your dataset.

Upvotes: 0

Related Questions