OHTO
OHTO

Reputation: 313

looped sklearn euclidean distances optimisation

Im looking from smart ways to optimise this looped euclidean distance calculation. This calculation is looking for the mean distance from all other vectors.

As my vector arrays are really big to just do: eucl_dist = euclidean_distances(eigen_vs_cleaned) Im running a loop row by row.

Typical eigen_vs_cleaned shape is at least (300000,1000) at the moment and I have to go up way more. (like 2000000,10000)

Any smarter way to do this?

eucl_dist_meaned = np.zeros(eigen_vs_cleaned.shape[0],dtype=float)

from sklearn.metrics.pairwise import euclidean_distances
for z in range(eigen_vs_cleaned.shape[0]):
    if z%10000==0:
        print(z)
    eucl_dist_temp = euclidean_distances(eigen_vs_cleaned[z].reshape(1, -1), eigen_vs_cleaned)
    eucl_dist_meaned[z] = eucl_dist_temp.mean(axis=1)

Upvotes: 0

Views: 450

Answers (1)

OHTO
OHTO

Reputation: 313

Im no python/numpy guru but this is the first step I took optimising this. it runs way better on my MacPro at least.

from joblib import Parallel, delayed
import multiprocessing
import os
import tempfile
import shutil

from sklearn.metrics.pairwise import euclidean_distances

# Creat a temporary directory and define the array pat
path = tempfile.mkdtemp()
out_path = os.path.join(path,'out.mmap')
out = np.memmap(out_path, dtype=float, shape=eigen_vs_cleaned.shape[0], mode='w+')

eucl_dist_meaned = np.zeros(eigen_vs_cleaned.shape[0],dtype=float)

num_cores = multiprocessing.cpu_count()

def runparallel(row, out):
    if row%10000==0:
        print(row)
    eucl_dist_temp = euclidean_distances(eigen_vs_cleaned[row].reshape(1, -1), eigen_vs_cleaned)
    out[row] = eucl_dist_temp.mean(axis=1)
    ##

nothing = Parallel(n_jobs=num_cores)(delayed(runparallel)(r, out) for r in range(eigen_vs_cleaned.shape[0]))

Then I save the output:

eucl_dist_meaned = np.array(out,copy=True,dtype=float)

Upvotes: 1

Related Questions