Sundios
Sundios

Reputation: 438

How to perform elbow method in python?

I want to check the optimal number of k using the elbow method. I'm not using the scikit-learn library. I have my k-means coded from scratch and now I'm having a difficult time figuring out how to code the elbow method in python. I'm a total beginner.

This is my k-means code:


def cluster_init(array, k):

    initial_assgnm = np.append(np.arange(k), np.random.randint(0, k, size=(len(array))))[:len(array)]
    np.random.shuffle(initial_assgnm)
    zero_arr = np.zeros((len(initial_assgnm), 1))

    for indx, cluster_assgnm in enumerate(initial_assgnm):
        zero_arr[indx] = cluster_assgnm
    upd_array = np.append(array, zero_arr, axis=1)

    return upd_array


def kmeans(array, k):

    cluster_array = cluster_init(array, k)


    while True:
        unique_clusters = np.unique(cluster_array[:, -1])

        centroid_dictonary = {}
        for cluster in unique_clusters:
            centroid_dictonary[cluster] = np.mean(cluster_array[np.where(cluster_array[:, -1] == cluster)][:, :-1], axis=0)


        start_array = np.copy(cluster_array)


        for row in range(len(cluster_array)):
            cluster_array[row, -1] = unique_clusters[np.argmin(
                [np.linalg.norm(cluster_array[row, :-1] - centroid_dictonary.get(cluster)) for cluster in unique_clusters])]

        if np.array_equal(cluster_array, start_array):
            break

    return centroid_dictonary

This is what I have tried for the elbow method:

cost = []
K= range(1,239)
for k in K :
    KM = kmeans(x,k)
    print(k)
    KM.fit(x)
    cost.append(KM.inertia_)

But I get the following error

KM.fit(x)

AttributeError: 'dict' object has no attribute 'fit'

Upvotes: 1

Views: 813

Answers (1)

Juan Carlos Ramirez
Juan Carlos Ramirez

Reputation: 2129

If you want to compute the elbow values from scratch, you need to compute the inertia for the current clustering assigment. To do this, you can compute the sum of the particle inertias. The particle inertia from a data point is the distance from its current position, to the closest center. If you have a function that computes this for you (in scikit-learn this function corresponds to pairwise_distances_argmin_min) you could do

labels, mindist = pairwise_distances_argmin_min(
    X=X, Y=centers, metric='euclidean', metric_kwargs={'squared': True})
inertia = mindist.sum()

If you actually wanted to write this function what you would do is loop over every row x in X, find the minimum over all y in Y of dist(x,y), and this would be your inertia for x. This naive method of computing the particle inertias is O(nk), so you might consider using the library function instead.

Upvotes: 1

Related Questions