Toby
Toby

Reputation: 2294

Lightgbm ranking example

Can anyone share a minimal example with data for how to train a ranking model with lightgbm? Preferably with the Scikit-Lean api? What I am struggling with is how to pass the label data. My data are page impressions and look like this:

X:
user1, feature1, ...
user2, feature1, ...

y:
user1, page1, 10 impressions
user1, page2, 6 impressions
user2, page1, 9 impressions

So far I think I have figured out that

But how do I give the information that for user1, page1 has been visited more often than page2?

Upvotes: 9

Views: 4357

Answers (1)

charelf
charelf

Reputation: 3825

Here is how I used LightGBM LambdaRank.

First we import some libraries and define our dataset

import numpy as np
import pandas as pd
import lightgbm

df = pd.DataFrame({
    "query_id":[i for i in range(100) for j in range(10)],
    "var1":np.random.random(size=(1000,)),
    "var2":np.random.random(size=(1000,)),
    "var3":np.random.random(size=(1000,)),
    "relevance":list(np.random.permutation([0,0,0,0,0, 0,0,0,1,1]))*100
})

Here is the dataframe:

     query_id      var1      var2      var3  relevance
0           0  0.624776  0.191463  0.598358          0
1           0  0.258280  0.658307  0.148386          0
2           0  0.893683  0.059482  0.340426          0
3           0  0.879514  0.526022  0.712648          1
4           0  0.188580  0.279471  0.062942          0
..        ...       ...       ...       ...        ...
995        99  0.509672  0.552873  0.166913          0
996        99  0.244307  0.356738  0.925570          0
997        99  0.827925  0.827747  0.695029          1
998        99  0.476761  0.390823  0.670150          0
999        99  0.241392  0.944994  0.671594          0

[1000 rows x 5 columns]

The structure of this dataset is important. In learning to rank tasks, you probably work with a set of queries. Here I define a dataset of 1000 rows, with 100 queries, each of 10 rows. These queries could also be of variable length.

Now for each query, we have some variables and we also get a relevance. I used numbers 0 and 1 here, so this is basically the task that for each query (set of 10 rows), I want to create a model that assigns higher relevance to the 2 rows that have a 1 for relevance.

Anyway, we continue with the setup for LightGBM. I split the dataset into a training set and validation set, but you can do whatever you want. I would recommend using at least 1 validation set during training.

train_df = df[:800]  # first 80%
validation_df = df[800:]  # remaining 20%

qids_train = train_df.groupby("query_id")["query_id"].count().to_numpy()
X_train = train_df.drop(["query_id", "relevance"], axis=1)
y_train = train_df["relevance"]

qids_validation = validation_df.groupby("query_id")["query_id"].count().to_numpy()
X_validation = validation_df.drop(["query_id", "relevance"], axis=1)
y_validation = validation_df["relevance"]

Now this is probably the thing you were stuck at. We create these 3 vectors/matrices for each dataframe. The X_train is the collection of your indepedent variables, so the input data for your model. y_train is your dependent variable, what you are trying to predict/rank. Lastly, qids_train are you query ids. They look like this:

array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
       10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
       10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
       10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
       10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10])

Also this is X_train:

         var1      var2      var3
0    0.624776  0.191463  0.598358
1    0.258280  0.658307  0.148386
2    0.893683  0.059482  0.340426
3    0.879514  0.526022  0.712648
4    0.188580  0.279471  0.062942
..        ...       ...       ...
795  0.014315  0.302233  0.255395
796  0.247962  0.871073  0.838955
797  0.605306  0.396659  0.940086
798  0.904734  0.623580  0.577026
799  0.745451  0.951092  0.861373

[800 rows x 3 columns]

and this is y_train:

0      0
1      0
2      0
3      1
4      0
      ..
795    0
796    0
797    1
798    0
799    0
Name: relevance, Length: 800, dtype: int64

Note that both of them are pandas dataframes, LightGBM supports them, however numpy arrays would also work.

As you can see they indicate the length of each query. If your queries would be of variable lenght, then the numbers in this list would also be different. In my example, all queries are the same length.

We do the exact same thing for the validation set, and then we are ready to start the LightGBM model setup and training. I use the SKlearn API since I am familiar with that one.

model = lightgbm.LGBMRanker(
    objective="lambdarank",
    metric="ndcg",
)

I only use the very minimum amount of parameters here. Feel free to take a look ath the LightGBM documentation and use more parameters, it is a very powerful library. To start the training process, we call the fit function on the model. Here we specify that we want NDCG@10, and want the function to print the results every 10th iteration.

model.fit(
    X=X_train,
    y=y_train,
    group=qids_train,
    eval_set=[(X_validation, y_validation)],
    eval_group=[qids_validation],
    eval_at=10,
    verbose=10,
)

which starts the training and prints:

[10]    valid_0's ndcg@10: 0.562929
[20]    valid_0's ndcg@10: 0.55375
[30]    valid_0's ndcg@10: 0.538355
[40]    valid_0's ndcg@10: 0.548532
[50]    valid_0's ndcg@10: 0.549039
[60]    valid_0's ndcg@10: 0.546288
[70]    valid_0's ndcg@10: 0.547836
[80]    valid_0's ndcg@10: 0.552541
[90]    valid_0's ndcg@10: 0.551994
[100]   valid_0's ndcg@10: 0.542401

I hope I could sufficiently illustrate the process with this simple example. Let me know if you have any questions left.

Upvotes: 14

Related Questions