Reputation: 375
I feel like I'm really missing something obvious when it comes to groups in sklearn data preparation and XGBoost regression parameters.
I've gone over this tutorial: https://medium.com/predictly-on-tech/learning-to-rank-using-xgboost-83de0166229d
as well as XGBRanker documentation.
What exactly is a group? Is it an arbitrary chunk of the dataset? It mentions that groups are important to ensure that you have "A column in your datasets that tells us which datapoints should be compared to what" in the tutorial, but my understanding is sklearn's train_test_split preserves the rows between train and test sets across both the features (X) and the labels (y).
My code uses train_test_split() like your standard data prep process would for classification, i.e.:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=X[<label column name>].values)
What do I have to change to add groups? It mentions that I can just use query ID's too, should I just add a column with randomly generated query ID's to the data before splitting?
Upvotes: 3
Views: 3625
Reputation: 12614
In learning-to-rank, you only care about rankings within each group. This is usually described in the context of search results: the groups are matches for a given query. In your linked article, a group is a given race.
If you don't know what your groups are, you might not be in a learning-to-rank situation, and perhaps a more straightforward classification or regression would be better suited.
Upvotes: 2