looped sklearn euclidean distances optimisation

Question

Im looking from smart ways to optimise this looped euclidean distance calculation. This calculation is looking for the mean distance from all other vectors.

As my vector arrays are really big to just do: eucl_dist = euclidean_distances(eigen_vs_cleaned) Im running a loop row by row.

Typical eigen_vs_cleaned shape is at least (300000,1000) at the moment and I have to go up way more. (like 2000000,10000)

Any smarter way to do this?

eucl_dist_meaned = np.zeros(eigen_vs_cleaned.shape[0],dtype=float)

from sklearn.metrics.pairwise import euclidean_distances
for z in range(eigen_vs_cleaned.shape[0]):
    if z%10000==0:
        print(z)
    eucl_dist_temp = euclidean_distances(eigen_vs_cleaned[z].reshape(1, -1), eigen_vs_cleaned)
    eucl_dist_meaned[z] = eucl_dist_temp.mean(axis=1)

OHTO · Accepted Answer

Im no python/numpy guru but this is the first step I took optimising this. it runs way better on my MacPro at least.

from joblib import Parallel, delayed
import multiprocessing
import os
import tempfile
import shutil

from sklearn.metrics.pairwise import euclidean_distances

# Creat a temporary directory and define the array pat
path = tempfile.mkdtemp()
out_path = os.path.join(path,'out.mmap')
out = np.memmap(out_path, dtype=float, shape=eigen_vs_cleaned.shape[0], mode='w+')

eucl_dist_meaned = np.zeros(eigen_vs_cleaned.shape[0],dtype=float)

num_cores = multiprocessing.cpu_count()

def runparallel(row, out):
    if row%10000==0:
        print(row)
    eucl_dist_temp = euclidean_distances(eigen_vs_cleaned[row].reshape(1, -1), eigen_vs_cleaned)
    out[row] = eucl_dist_temp.mean(axis=1)
    ##

nothing = Parallel(n_jobs=num_cores)(delayed(runparallel)(r, out) for r in range(eigen_vs_cleaned.shape[0]))

Then I save the output:

eucl_dist_meaned = np.array(out,copy=True,dtype=float)

looped sklearn euclidean distances optimisation

Answers (1)

Related Questions