artemis
artemis

Reputation: 7281

Use Numpy to Dynamically Create Arrays

I am trying to use numpy to dynamically create a set of zeros based on the size of a separate numpy array.

This is a small portion of the code of a much larger project. I have posted everything relevant in this question. I have a function k means which takes in a dataset (posted below) and a k value (which is 3, for this example). I create a variable centroids which is supposed to look something like

[[4.9 3.1 1.5 0.1]
[7.2 3.  5.8 1.6]
[7.2 3.6 6.1 2.5]]

From there, I need to create a numpy array of "labels", one corresponding to every row in the dataset, of all zeroes with the same shape as the centroids array. Meaning, for a dataset with 5 rows, it would look like:
[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]
This is what I am trying to achieve, albiet on a dynamic scale (i.e. where the # of rows and columns in the dataset are unknown).

The following (hard coded, non numpy) satisfies that (assuming there are 150 lines in the dataset:

def k_means(dataset, k):
    centroids = [[5,3,2,4.5],[5,3,2,5],[2,2,2,2]]

    cluster_labels = []
    for i in range(0,150):
        cluster_labels.append([0,0,0,0])
    print (cluster_labels)

I am trying to do this dynamically with the following:

def k_means(dataset, k):
    centroids = dataset[numpy.random.choice(dataset.shape[0], k, replace=False), :]
    print(centroids)

    cluster_labels = []
    cluster_labels = numpy.asarray(cluster_labels)
    for index in range(len(dataset)):
        # temp_array = numpy.zeros_like(centroids)
        # print(temp_array)
        cluster_labels = cluster_labels.append(cluster_labels, numpy.zeros_like(centroids))

The current result is: AttributeError: 'numpy.ndarray' object has no attribute 'append'
Or, if I comment out the cluster_labels line and uncomment the temp, I get:

[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]

I will ultimately get 150 sets of that.

Sample of Iris Dataset:

5.1 3.5 1.4 0.2
4.9 3   1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5   3.6 1.4 0.2
5.4 3.9 1.7 0.4
4.6 3.4 1.4 0.3
5   3.4 1.5 0.2
4.4 2.9 1.4 0.2
4.9 3.1 1.5 0.1
5.4 3.7 1.5 0.2
4.8 3.4 1.6 0.2
4.8 3   1.4 0.1
4.3 3   1.1 0.1
5.8 4   1.2 0.2
5.7 4.4 1.5 0.4
5.4 3.9 1.3 0.4
5.1 3.5 1.4 0.3
5.7 3.8 1.7 0.3
5.1 3.8 1.5 0.3
5.4 3.4 1.7 0.2
5.1 3.7 1.5 0.4
4.6 3.6 1   0.2
5.1 3.3 1.7 0.5
4.8 3.4 1.9 0.2
5   3   1.6 0.2
5   3.4 1.6 0.4
5.2 3.5 1.5 0.2
5.2 3.4 1.4 0.2
4.7 3.2 1.6 0.2
4.8 3.1 1.6 0.2
5.4 3.4 1.5 0.4
5.2 4.1 1.5 0.1
5.5 4.2 1.4 0.2

Can anybody help me dynamically use numpy to achieve what I am aiming for?

Thanks.

Upvotes: 0

Views: 1763

Answers (1)

Alperen
Alperen

Reputation: 4652

shape of a numpy array is the size of the array. In a 2D array shape represents (number of rows, number of columns). So, shape[0] is the number of rows and shape[1] is the number of columns. You can use numpy.zeros((dataset.shape[0], centroids.shape[1])) to create a numpy array with your desired dimensions. Here is an example code with modified version of your k-means function.

import numpy

def k_means(dataset, k):
    centroids = dataset[numpy.random.choice(dataset.shape[0], k, replace=False), :]
    print(centroids)

    cluster_labels = numpy.zeros((dataset.shape[0], centroids.shape[1]))
    print(cluster_labels)


dataset = numpy.array([[1,2,3,4,5,6,7,8,9,0], 
                    [3,4,5,6,4,3,2,2,6,7],
                    [4,4,5,6,7,7,8,9,9,0], 
                    [5,6,7,8,5,3,3,2,2,1],
                    [6,3,3,2,2,4,5,6,6,8]])

k_means(dataset, 2)

Output:

[[1 2 3 4 5 6 7 8 9 0]
 [5 6 7 8 5 3 3 2 2 1]]
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

I used numpy.zeros((dataset.shape[0], centroids.shape[1])) to make it more similar to your code. Actually, numpy.zeros(dataset.shape) would do the same thing, because centroids.shape[1] and dataset.shape[1] is the same. The number of columns of centroids and the number columns dataset are the same, because you choose your centroids from the dataset. So, the last version should be like:

def k_means(dataset, k):
    centroids = dataset[numpy.random.choice(dataset.shape[0], k, replace=False), :]
    cluster_labels = numpy.zeros(dataset.shape)

Upvotes: 2

Related Questions