Sam V
Sam V

Reputation: 619

How to calculate Gini Index using two numpy arrays

So for a class on machine learning I need to calculate the Gini index for a decision tree with 2 classes (0 and 1 in this case). I have read multiple sources on how to calculate this, but I can not seem to get it working in my own script. Having tried about 10 different calculations I am getting kind of desperate.

The arrays are:

Y_left = np.array([[1.],[0.],[0.],[1.],[1.],[1.],[1.]])
Y_right = np.array([[1.],[0.],[0.],[0.],[1.],[0.],[0.],[1.],[0.]])

And the output should be 0.42857.

Formula

With C being the set of class labels (so 2), S_L and S_R the two splits determined by the splitting criteria.

What I have right now:

def tree_gini_index(Y_left, Y_right, classes):
    """Compute the Gini Index.
    # Arguments
        Y_left: class labels of the data left set
            np.array of size `(n_objects, 1)`
        Y_right: class labels of the data right set
            np.array of size `(n_objects, 1)`
        classes: list of all class values
    # Output
        gini: scalar `float`
    """
    gini = 0.0
    total = len(Y_left) + len(Y_right)
    gini = sum((sum(Y_left) / total)**2, (sum(Y_right) / total)**2)
    return gini

If anyone could give me any directions on how to define this function I would be very grateful.

Upvotes: 1

Views: 3282

Answers (1)

swag2198
swag2198

Reputation: 2696

This function computes the gini index for each of the left or right labels arrays. probs simply stores the probabilities p_c for each class according to your formula.

import numpy as np

def gini(y, classes):

    y = y.reshape(-1, )                             # Just flattens the 2D array into 1D array for simpler calculations
    if not y.shape[0]:
        return 0
    
    probs = []
    for cls in classes:
        probs.append((y == cls).sum() / y.shape[0]) # For each class c in classes compute class probabilities
    
    p = np.array(probs)
    return 1 - ((p*p).sum())

After that, this function computes their weighted (by number of samples) average to produce the final gini index value for the corresponding split. Note that p_L and p_R serve the roles of |S_n|/|S| in your formula where n is {left, right}.

def tree_gini_index(Y_left, Y_right, classes):
    
    N = Y_left.shape[0] + Y_right.shape[0]
    p_L = Y_left.shape[0] / N
    p_R = Y_right.shape[0] / N
    
    return p_L * gini(Y_left, classes) + p_R * gini(Y_right, classes)

Call it as:

Y_left = np.array([[1.],[0.],[0.],[1.],[1.],[1.],[1.]])
Y_right = np.array([[1.],[0.],[0.],[0.],[1.],[0.],[0.],[1.],[0.]])
tree_gini_index(Y_left, Y_right, [0, 1])

Output:

0.4285714285714286

Upvotes: 1

Related Questions