Reputation: 99
I would like to use a multinomial logistic regression to get win probabilities for each of the 5 horses that participate in any given race using each horses previous average speed.
RACE_ID H1_SPEED H2_SPEED H3_SPEED H4_SPEED H5_SPEED WINNING_HORSE
1 40.482081 44.199627 42.034929 39.004813 43.830139 5
2 39.482081 42.199627 41.034929 41.004813 40.830139 4
I am stuck on how to handle the independent variables for each horse given that any of the 5 horses average speed can be placed in any of H1_SPEED
through H5_SPEED
.
Given the fact that for each race I can put any of the 5 horses under H1_SPEED
meaning there is no real relationship between H1_SPEED
from RACE_ID 1
and H1_SPEED
from RACE_ID 2
other than the arbitrary position I selected.
Would there be any difference if the dataset looked like this -
RACE_ID 1
I swapped H3_SPEED
and H5_SPEED
and changed WINNING_HORSE
from 5
to 3
RACE_ID 2
I swapped H4_SPEED
and H1_SPEED
and changed WINNING_HORSE
from 4
to 1
RACE_ID H1_SPEED H2_SPEED H3_SPEED H4_SPEED H5_SPEED WINNING_HORSE
1 40.482081 44.199627 43.830139 39.004813 42.034929 3
2 41.004813 42.199627 41.034929 39.482081 40.830139 1
Is this an issue, if so how should this be handled? What if I wanted to add more independent features per horse?
Upvotes: 1
Views: 164
Reputation: 77
You cannot change in that way your dataset, because each feature (column) has a meaning and probably it depends on the values of the other features. You can imagine it as a six dimensional hyperplane, if you change the value of a feature the position of the point in the hyperplane changes, it does not remain stationary.
If you deem that a feature is useless to solve your problem (i.e. it is independent from the target), you can drop it or avoid to use it during the training phase of your model.
Edit
To solve your specific problem you may add a parameter for each speed column that takes care of the specific horse which is running with that speed. It is a sort of data augmentation, in order to add more problem related features to your model.
RACE_ID H1_SPEED H1_HORSE H2_SPEED H2_HORSE ... WINNING_HORSE
1 40.482081 1 44.199627 2 ... 5
2 39.482081 3 42.199627 5 ... 4
I've invented the number associated to each horse, but it seems that this information is present in your dataset.
Upvotes: 0