Reputation: 7400
I am trying to predict the output of tennis matches - just a fun side project. Im using a random forest regressor to do this. now, one of the features is the ranking of the player before a specific match. for many matches I dont have a ranking (I only have the first 200 ranked). The question is - is it better to put a value that is not an integer, like for example the string "NoRank"
, or put an integer that is beyond the range of 1-200
? Considering the learning algorithm, Im inclined to put the value 201
, but I would like to hear your opinions on this..
Thanks!
Upvotes: 1
Views: 2659
Reputation: 69
I will be careful with just adding 201 (or any other value) to the nonranked ones. RF normalize the data (Do I need to normalize (or scale) data for randomForest (R package)?), which means it can group 200 with 201 in the split, or it might not. you are basically faking data that you do not have.
I will add another column: "haverank" and use a 0/1 for it. 0 will be for people without rank 1 for people with rank.
call it "highrank" if the name sounds better. you can also add another column named "veryhighrank" and give the value 1 to all players between ranks 1-50. etc...
Upvotes: 0
Reputation: 40169
scikit-learn random forests do not support missing values unfortunately. If you think that unranked players are likely to behave worst that players ranked 200 on average then inputing the 201 rank makes sense.
Note: all scikit-learn models expect homogeneous numerical input features, not string labels or other python objects. If you have string labels as features you first need to find the right feature extraction strategy depending on the meaning of your string features (e.g. categorical variable identifiers or free text to be extracted as a bag of words).
Upvotes: 2