Understanding Groups in scikit-learn for XGBoost Ranking

Question

I feel like I'm really missing something obvious when it comes to groups in sklearn data preparation and XGBoost regression parameters.

I've gone over this tutorial: https://medium.com/predictly-on-tech/learning-to-rank-using-xgboost-83de0166229d

as well as XGBRanker documentation.

What exactly is a group? Is it an arbitrary chunk of the dataset? It mentions that groups are important to ensure that you have "A column in your datasets that tells us which datapoints should be compared to what" in the tutorial, but my understanding is sklearn's train_test_split preserves the rows between train and test sets across both the features (X) and the labels (y).

My code uses train_test_split() like your standard data prep process would for classification, i.e.:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=X[].values)

What do I have to change to add groups? It mentions that I can just use query ID's too, should I just add a column with randomly generated query ID's to the data before splitting?

Understanding Groups in scikit-learn for XGBoost Ranking

Answers (1)

Related Questions