Reputation: 219
I'm trying to run a K-Means Clustering algorithm with numpy and python, but keep running into a memory error if I use larger values of K (anything greater than 10 seems to result in the error). I have two numpy arrays of size [42000,784] (the dataset) and [K,784] (the centroids). The memory error occurs when calculating the euclidean distance between each of the centroids and each of the data points. This is the function I have been using:
def dist(a,b):
a = a[np.newaxis,:,:]
b = b[:,np.newaxis,:]
dist = np.linalg.norm((a-b), axis=2)
return dist
Is this a memory leak or do I legitimately not have enough memory (I have 8GB)? How can I fix this?
Upvotes: 2
Views: 2298
Reputation: 2816
As an alternative way, using numba accelerator in nopython parallel mode, can be one of the fastest methods. I have compared performances of numba and cdist
on various array sizes, both consume near the same time (e.g. both takes 8 second on my machine), perhaps numba beats cdist
on smaller arrays:
import numba as nb
@nb.njit("float64[:, ::1](float64[:, ::1], float64[:, ::1])", parallel=True)
def distances_numba(a, b):
arr = np.zeros((a.shape[0], b.shape[0]), dtype=np.float64)
temp_arr = np.zeros_like(arr)
for i in nb.prange(a.shape[0]):
for j in range(b.shape[0]):
for k in range(a.shape[1]):
temp_arr[i, j] += (a[i, k] - b[j, k]) ** 2
arr[i, j] = temp_arr[i, j] ** 0.5
return arr
I did not compare memory consumptions, but I think numba will be one of the best in this regard.
We can parallelize the max9111 answer, that was 10-20 times faster than cdist
or my first solution. The parallelization makes the max9111 solution faster around 1.5 to 2 times. The benchmarks are based on some of my tests and need more evaluations.
@nb.njit("float64[::1](float64[:, ::1])", parallel=True)
def first(A):
TMP_A = np.zeros(A.shape[0], dtype=np.float64)
for i in nb.prange(A.shape[0]):
for j in range(A.shape[1]):
TMP_A[i] += A[i, j] ** 2
return TMP_A
@nb.njit("float64[::1](float64[:, ::1])", parallel=True)
def second(B):
TMP_B = np.zeros(B.shape[0], dtype=np.float64)
for i in nb.prange(B.shape[0]):
for j in range(B.shape[1]):
TMP_B[i] += B[i, j] ** 2
return TMP_B
@nb.njit("float64[:, ::1](float64[:, ::1], float64[:, ::1])", parallel=True)
def calc_dist_p(A, B):
dist = np.dot(A, B.T)
TMP_A = first(A)
TMP_B = second(B)
for i in nb.prange(A.shape[0]):
for j in range(B.shape[0]):
dist[i, j] = (-2. * dist[i, j] + TMP_A[i] + TMP_B[j]) ** 0.5
return dist
Upvotes: 1
Reputation: 5486
scipy
has built-in functions for distance computations, which are lightning fast compared to home made implementations.
So, the first idea is to replace your whole distance
function by the following expression:
from numpy.random import rand
from scipy.spatial import distance
# sample data
a = randn(42000, 784
b = randn(256, 784)
# distance computation
dist = distance.cdist(a, b, metric='euclidean') # about 8.02 s on
# my 8 GB RAM machine
Note that dist
in this example is transposed according to your example. If you want to the shape of your example just do dist = distance.cdist(a, b).T
.
It is further possible to speed-up the computation a little by omitting the square root operation. You may accomplish this by dist = distance.cdist(a, b, metric='sqeuclidean')
.
This whole approach does not greatly reduce memory consumption but it takes the memory only for a few seconds.
The second idea is to not use home made implementations at all, but some reliabe third party packages, like the well-knwon Scikit Learn
:
from sklear.cluster import KMeans
a = randn(4200, 200)
km = KMeans(n_clusters=256)
km.fit(a) # about 10 s
One of several advantages of this implementation is, that it automatically decides how to compute the distances so that it does not blow your memory.
Upvotes: 2