input for scikit-learn random forest

Question

I am trying to predict the output of tennis matches - just a fun side project. Im using a random forest regressor to do this. now, one of the features is the ranking of the player before a specific match. for many matches I dont have a ranking (I only have the first 200 ranked). The question is - is it better to put a value that is not an integer, like for example the string "NoRank", or put an integer that is beyond the range of 1-200? Considering the learning algorithm, Im inclined to put the value 201, but I would like to hear your opinions on this.. Thanks!

ogrisel · Accepted Answer

scikit-learn random forests do not support missing values unfortunately. If you think that unranked players are likely to behave worst that players ranked 200 on average then inputing the 201 rank makes sense.

Note: all scikit-learn models expect homogeneous numerical input features, not string labels or other python objects. If you have string labels as features you first need to find the right feature extraction strategy depending on the meaning of your string features (e.g. categorical variable identifiers or free text to be extracted as a bag of words).

input for scikit-learn random forest

Answers (2)

Related Questions