Reputation: 377
This is for a K-Means Algorithm. This is for homework, so I do not want to use the built in Kmeans function. I have 2 numpy arrays. One is of centroids. The other is of data points. I am trying to find the distance from each of the centroids to each of the data points. I don't know how to pass the arrays to my function in order for it to print. I want to end up with as many arrays of distances as there are centroids. Then I can compare each distance in the arrays, choose the smallest distance and assign that point to one of the clusters. Then find the mean of each of the clusters, and those numbers become my new centroids.
import numpy as np
centroids = np.array([[3,44],[5,15]])
dataPoints = np.array([[2,4],[17,4],[45,2],[45,7],[16,32],[32,14],[20,56],[68,33]])
def distance(a,b):
for x in a: #for each point in centroids array
for y in b:#for each point in the dataPoints array
print np.sqrt((a[0] - b[0])**2 + (a[1] - b[1])**2)#print the distance
distance (randPoints, dataPoints)#call the function with the data
The output I am getting:
[ 12.04159458 41.48493703]
[ 12.04159458 41.48493703]
[ 12.04159458 41.48493703]
[ 12.04159458 41.48493703]
[ 12.04159458 41.48493703]
[ 12.04159458 41.48493703]
[ 12.04159458 41.48493703]
[ 12.04159458 41.48493703]
[ 12.04159458 41.48493703]
[ 12.04159458 41.48493703]
[ 12.04159458 41.48493703]
[ 12.04159458 41.48493703]
[ 12.04159458 41.48493703]
[ 12.04159458 41.48493703]
[ 12.04159458 41.48493703]
[ 12.04159458 41.48493703]
What am I doing that is obviously wrong here? I should end up with 2 different arrays with 8 distances each.
Upvotes: 1
Views: 4940
Reputation: 2312
I got sick of coming up with incarnations for distance calculations for 1, 2 and 3d arrays, so I cobbled together a function that emulates pdist and cdist from scipy, but uses einsum that many people use on this site. It is easy to follow in my mind at least and einsum is versatile for other purposes. So consider the following. You can use then use sorting (sort, argsort etc) if you need to extract closest-x values etc. Hope you find it useful
a = np.array([[1, 2], [3, 4], [5, 6]])
b = np.array([[6, 5], [4, 3], [2, 1]])
def e_dist(a, b, metric='euclidean'):
"""Distance calculation for 1D, 2D and 3D points using einsum
: a, b - list, tuple, array in 1,2 or 3D form
: metric - euclidean ('e','eu'...), sqeuclidean ('s','sq'...),
:-----------------------------------------------------------------------
"""
a = np.asarray(a)
b = np.atleast_2d(b)
a_dim = a.ndim
b_dim = b.ndim
if a_dim == 1:
a = a.reshape(1, 1, a.shape[0])
if a_dim >= 2:
a = a.reshape(np.prod(a.shape[:-1]), 1, a.shape[-1])
if b_dim > 2:
b = b.reshape(np.prod(b.shape[:-1]), b.shape[-1])
diff = a - b
dist_arr = np.einsum('ijk,ijk->ij', diff, diff)
if metric[:1] == 'e':
dist_arr = np.sqrt(dist_arr)
dist_arr = np.squeeze(dist_arr)
return dist_arr
e_dist(a, b)
array([[ 5.8, 3.2, 1.4],
[ 3.2, 1.4, 3.2],
[ 1.4, 3.2, 5.8]])
e_dist(a[0], b)
array([ 5.8, 3.2, 1.4])
e_dist(a[:2], b)
array([[ 5.8, 3.2, 1.4],
[ 3.2, 1.4, 3.2]])
Upvotes: 2
Reputation: 8131
import numpy as np
centroids = np.array([[3,44],[5,15]])
dataPoints = np.array([[2,4],[17,4],[45,2],[45,7],[16,32],[32,14],[20,56],[68,33]])
def size(vector):
return np.sqrt(sum(x**2 for x in vector))
def distance(vector1, vector2):
return size(vector1 - vector2)
def distances(array1, array2):
return [[distance(vector1, vector2) for vector2 in array2] for vector1 in array1]
print(distances(centroids, dataPoints))
Upvotes: 1