Reputation: 6080
In my data, there are about 70 classes and I am using lightGBM to predict the correct class label.
In R, would like to have a customised "metric" function where I can evaluate whether top 3 predictions by lightgbm cover the true label.
The link here is inspiring to see
def lgb_f1_score(y_hat, data):
y_true = data.get_label()
y_hat = np.round(y_hat) # scikits f1 doesn't like probabilities
return 'f1', f1_score(y_true, y_hat), True
however I don't know the dimensionality of the arguments going to function. seems data are shuffled for some reason.
Upvotes: 4
Views: 2673
Reputation: 1310
After reading through the docs for lgb.train and lgb.cv, I had to make a separate function get_ith_pred
and then call that repeatedly within lgb_f1_score
.
The function's docstring explains how it works. I have used the same argument names as in the LightGBM docs. This can work for any number of classes but does not work for binary classification. In the binary case, preds
is a 1D array containing the probability of the positive class.
from sklearn.metrics import f1_score
def get_ith_pred(preds, i, num_data, num_class):
"""
preds: 1D NumPY array
A 1D numpy array containing predicted probabilities. Has shape
(num_data * num_class,). So, For binary classification with
100 rows of data in your training set, preds is shape (200,),
i.e. (100 * 2,).
i: int
The row/sample in your training data you wish to calculate
the prediction for.
num_data: int
The number of rows/samples in your training data
num_class: int
The number of classes in your classification task.
Must be greater than 2.
LightGBM docs tell us that to get the probability of class 0 for
the 5th row of the dataset we do preds[0 * num_data + 5].
For class 1 prediction of 7th row, do preds[1 * num_data + 7].
sklearn's f1_score(y_true, y_pred) expects y_pred to be of the form
[0, 1, 1, 1, 1, 0...] and not probabilities.
This function translates preds into the form sklearn's f1_score
understands.
"""
# Only works for multiclass classification
assert num_class > 2
preds_for_ith_row = [preds[class_label * num_data + i]
for class_label in range(num_class)]
# The element with the highest probability is predicted
return np.argmax(preds_for_ith_row)
def lgb_f1_score(preds, train_data):
y_true = train_data.get_label()
num_data = len(y_true)
num_class = 70
y_pred = []
for i in range(num_data):
ith_pred = get_ith_pred(preds, i, num_data, num_class)
y_pred.append(ith_pred)
return 'f1', f1_score(y_true, y_pred, average='weighted'), True
Upvotes: 1
Reputation:
Scikit-learn implementation
from sklearn.metrics import f1_score
def lgb_f1_score(y_true, y_pred):
preds = y_pred.reshape(len(np.unique(y_true)), -1)
preds = preds.argmax(axis = 0)
print(preds.shape)
print(y_true.shape)
return 'f1', f1_score(y_true, preds,average='weighted'), True
Upvotes: 4