LunarLlama
LunarLlama

Reputation: 219

Memory Error Numpy/Python Euclidean Distance

I'm trying to run a K-Means Clustering algorithm with numpy and python, but keep running into a memory error if I use larger values of K (anything greater than 10 seems to result in the error). I have two numpy arrays of size [42000,784] (the dataset) and [K,784] (the centroids). The memory error occurs when calculating the euclidean distance between each of the centroids and each of the data points. This is the function I have been using:

def dist(a,b):
    a = a[np.newaxis,:,:]
    b = b[:,np.newaxis,:]
    dist = np.linalg.norm((a-b), axis=2) 
    return dist

Is this a memory leak or do I legitimately not have enough memory (I have 8GB)? How can I fix this?

Upvotes: 2

Views: 2298

Answers (2)

Ali_Sh
Ali_Sh

Reputation: 2816

As an alternative way, using numba accelerator in nopython parallel mode, can be one of the fastest methods. I have compared performances of numba and cdist on various array sizes, both consume near the same time (e.g. both takes 8 second on my machine), perhaps numba beats cdist on smaller arrays:

import numba as nb

@nb.njit("float64[:, ::1](float64[:, ::1], float64[:, ::1])", parallel=True)
def distances_numba(a, b):
    arr = np.zeros((a.shape[0], b.shape[0]), dtype=np.float64)
    temp_arr = np.zeros_like(arr)

    for i in nb.prange(a.shape[0]):
        for j in range(b.shape[0]):
            for k in range(a.shape[1]):
                temp_arr[i, j] += (a[i, k] - b[j, k]) ** 2
            arr[i, j] = temp_arr[i, j] ** 0.5

    return arr

I did not compare memory consumptions, but I think numba will be one of the best in this regard.

Update

We can parallelize the max9111 answer, that was 10-20 times faster than cdist or my first solution. The parallelization makes the max9111 solution faster around 1.5 to 2 times. The benchmarks are based on some of my tests and need more evaluations.

@nb.njit("float64[::1](float64[:, ::1])", parallel=True)
def first(A):
    TMP_A = np.zeros(A.shape[0], dtype=np.float64)
    for i in nb.prange(A.shape[0]):
        for j in range(A.shape[1]):
            TMP_A[i] += A[i, j] ** 2
    return TMP_A


@nb.njit("float64[::1](float64[:, ::1])", parallel=True)
def second(B):
    TMP_B = np.zeros(B.shape[0], dtype=np.float64)
    for i in nb.prange(B.shape[0]):
        for j in range(B.shape[1]):
            TMP_B[i] += B[i, j] ** 2
    return TMP_B


@nb.njit("float64[:, ::1](float64[:, ::1], float64[:, ::1])", parallel=True)
def calc_dist_p(A, B):
    dist = np.dot(A, B.T)
    TMP_A = first(A)
    TMP_B = second(B)
    for i in nb.prange(A.shape[0]):
        for j in range(B.shape[0]):
            dist[i, j] = (-2. * dist[i, j] + TMP_A[i] + TMP_B[j]) ** 0.5
    return dist

Upvotes: 1

MaxPowers
MaxPowers

Reputation: 5486

scipy has built-in functions for distance computations, which are lightning fast compared to home made implementations.

So, the first idea is to replace your whole distance function by the following expression:

from numpy.random import rand
from scipy.spatial import distance

# sample data
a = randn(42000, 784
b = randn(256, 784)

# distance computation
dist = distance.cdist(a, b, metric='euclidean')    # about 8.02 s on 
                                                   # my 8 GB RAM machine

Note that dist in this example is transposed according to your example. If you want to the shape of your example just do dist = distance.cdist(a, b).T.

It is further possible to speed-up the computation a little by omitting the square root operation. You may accomplish this by dist = distance.cdist(a, b, metric='sqeuclidean').

This whole approach does not greatly reduce memory consumption but it takes the memory only for a few seconds.

The second idea is to not use home made implementations at all, but some reliabe third party packages, like the well-knwon Scikit Learn:

from sklear.cluster import KMeans
a = randn(4200, 200)

km = KMeans(n_clusters=256)
km.fit(a)                    # about 10 s

One of several advantages of this implementation is, that it automatically decides how to compute the distances so that it does not blow your memory.

Upvotes: 2

Related Questions