K Means in Python from Scratch

Question

I have a python code for a k-means algorithm. I am having a hard time understanding what it does. Lines like C = X[numpy.random.choice(X.shape[0], k, replace=False), :] are very confusing to me.

Could someone explain what this code is actually doing? Thank you

def k_means(data, k, num_of_features):
    # Make a matrix out of the data
    X = data.as_matrix()
    # Get k random points from the data
    C =  X[numpy.random.choice(X.shape[0], k, replace=False), :]
    # Remove the last col
    C = [C[j][:-1] for j in range(len(C))]
    # Turn it into a numpy array
    C = numpy.asarray(C)
    # To store the value of centroids when it updates
    C_old = numpy.zeros(C.shape)
    # Make an array that will assign clusters to each point
    clusters = numpy.zeros(len(X))
    # Error func. - Distance between new centroids and old centroids
    error = dist(C, C_old, None)
    # Loop will run till the error becomes zero of 5 tries
    tries = 0
    while error != 0 and tries < 1:
        # Assigning each value to its closest cluster
        for i in range(len(X)):
            # Get closest cluster in terms of distance
            clusters[i] = dist1(X[i][:-1], C)
        # Storing the old centroid values
        C_old = deepcopy(C)
        # Finding the new centroids by taking the average value
        for i in range(k):
            # Get all of the points that match the cluster you are on
            points = [X[j][:-1] for j in range(len(X)) if clusters[j] == i]
            # If there were no points assigned to cluster, put at origin
            if not points:
                C[i][:] = numpy.zeros(C[i].shape)
            else:
                # Get the average of all the points and put that centroid there
                C[i] = numpy.mean(points, axis=0)
        # Erro is the distance between where the centroids use to be and where they are now
        error = dist(C, C_old, None)
        # Increase tries
        tries += 1
    return sil_coefficient(X,clusters,k)

Ethan Henderson · Accepted Answer

(Expanded answer, will format later) X is the data, as a matrix. Using the [] notation, we are taking slices, or selecting single element, from the matrix. You may want to review numpy array indexing. https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html numpy.random.choice selects k elements at random from the size of the first dimension of the data matrix without replacement. Notice, that in indexing, using the [] syntax, we see we have two entries. The numpy.random.choice, and ":". ":" indicates that we are taking everything along that axis.

Thus, X[numpy.random.choice(X.shape[0], k, replace=False), :] means we select an element along the first axis and take every element along the second which shares that first index. Effectively, we are selecting a random row of a matrix.

(The comments expalain this code quite well, I would suggest you read into numpy indexing an list comprehensions for further elucidation).

C[C[j][:-1] for j in range(len(c))] The part after "C[" uses a list comprehension in order to select parts of the matrix C.

C[j] represents the rows of the matrix C. We use the [:-1] to take up to, but not including the final element of the row. We do this for each row in the matrix C. This removes the last column of the matrix.

C = numpy.asarray(C). This converts the matrix to a numpy array so we can do special numpy things with it.

C_old = numpy.zeros(C.shape). This creates a zero matrix, to later be populated, which is the same size as C. We are initializing this array to be populated later.

clusters = numpy.zeros(len(x)). This creates a zero vector whose dimension is the same as the number of rows in the matrix X. This vector will be populated later. We are initializing this array to be populated later.

error = dist(C, C_old, None). Take the distance between the two matrices. I believe this function to be defined elsewhere in your script.

tries = 0. Set the tires counter to 0.

while...do this block while this condition is true.

for i in [0...(number of rows in X - 1)]:

clusters[i] = dist1(X[i][:-1], C); Put which cluster the ith row of X is closest to in the ith position of clusters.

C_old = deepcopy(C) - Create a copy of C which is new. Don't just move pointers.

for each (0..number of means - 1):

points = [X[j][:-1] for j in range(len(X)) if clusters[j] == i]. This is a list comprehension. Create a list of the rows of X, with all but the last entry, but only include the row if it belongs to the jth cluster.

if not points. If nothing belongs to a cluster.

C[i][:] = numpy.zeros(C[i].shape). Create a vector of zeros, to be populated later, and use this vector as the ith row of the clusters matrix, C.

else:

C[i] = np.mean(points, axis=0). Assign the ith row of the clusters matrix, C, to be the average point in the cluster. We sum across the rows (axis=0). This is us updating our clusters.

K Means in Python from Scratch

Answers (1)

Related Questions