Areza
Areza

Reputation: 6080

customized metric function for multi class in lightgbm

In my data, there are about 70 classes and I am using lightGBM to predict the correct class label.

In R, would like to have a customised "metric" function where I can evaluate whether top 3 predictions by lightgbm cover the true label.

The link here is inspiring to see

def lgb_f1_score(y_hat, data):
    y_true = data.get_label()
    y_hat = np.round(y_hat) # scikits f1 doesn't like probabilities
    return 'f1', f1_score(y_true, y_hat), True

however I don't know the dimensionality of the arguments going to function. seems data are shuffled for some reason.

Upvotes: 4

Views: 2673

Answers (2)

codeananda
codeananda

Reputation: 1310

After reading through the docs for lgb.train and lgb.cv, I had to make a separate function get_ith_pred and then call that repeatedly within lgb_f1_score.

The function's docstring explains how it works. I have used the same argument names as in the LightGBM docs. This can work for any number of classes but does not work for binary classification. In the binary case, preds is a 1D array containing the probability of the positive class.

from sklearn.metrics import f1_score

def get_ith_pred(preds, i, num_data, num_class):
    """
    preds: 1D NumPY array
        A 1D numpy array containing predicted probabilities. Has shape
        (num_data * num_class,). So, For binary classification with 
        100 rows of data in your training set, preds is shape (200,), 
        i.e. (100 * 2,).
    i: int
        The row/sample in your training data you wish to calculate
        the prediction for.
    num_data: int
        The number of rows/samples in your training data
    num_class: int
        The number of classes in your classification task.
        Must be greater than 2.
    
    
    LightGBM docs tell us that to get the probability of class 0 for 
    the 5th row of the dataset we do preds[0 * num_data + 5].
    For class 1 prediction of 7th row, do preds[1 * num_data + 7].
    
    sklearn's f1_score(y_true, y_pred) expects y_pred to be of the form
    [0, 1, 1, 1, 1, 0...] and not probabilities.
    
    This function translates preds into the form sklearn's f1_score 
    understands.
    """
    # Only works for multiclass classification
    assert num_class > 2

    preds_for_ith_row = [preds[class_label * num_data + i]
                        for class_label in range(num_class)]
    
    # The element with the highest probability is predicted
    return np.argmax(preds_for_ith_row)

    
def lgb_f1_score(preds, train_data):
    y_true = train_data.get_label()

    num_data = len(y_true)
    num_class = 70
    
    y_pred = []
    for i in range(num_data):
        ith_pred = get_ith_pred(preds, i, num_data, num_class)
        y_pred.append(ith_pred)
    
    return 'f1', f1_score(y_true, y_pred, average='weighted'), True

Upvotes: 1

user6776983
user6776983

Reputation:

Scikit-learn implementation

from sklearn.metrics import f1_score

def lgb_f1_score(y_true, y_pred):
    preds = y_pred.reshape(len(np.unique(y_true)), -1)
    preds = preds.argmax(axis = 0)
    print(preds.shape)
    print(y_true.shape)
    return 'f1', f1_score(y_true, preds,average='weighted'), True

Upvotes: 4

Related Questions