Reputation: 313
Im looking from smart ways to optimise this looped euclidean distance calculation. This calculation is looking for the mean distance from all other vectors.
As my vector arrays are really big to just do: eucl_dist = euclidean_distances(eigen_vs_cleaned) Im running a loop row by row.
Typical eigen_vs_cleaned shape is at least (300000,1000) at the moment and I have to go up way more. (like 2000000,10000)
Any smarter way to do this?
eucl_dist_meaned = np.zeros(eigen_vs_cleaned.shape[0],dtype=float)
from sklearn.metrics.pairwise import euclidean_distances
for z in range(eigen_vs_cleaned.shape[0]):
if z%10000==0:
print(z)
eucl_dist_temp = euclidean_distances(eigen_vs_cleaned[z].reshape(1, -1), eigen_vs_cleaned)
eucl_dist_meaned[z] = eucl_dist_temp.mean(axis=1)
Upvotes: 0
Views: 450
Reputation: 313
Im no python/numpy guru but this is the first step I took optimising this. it runs way better on my MacPro at least.
from joblib import Parallel, delayed
import multiprocessing
import os
import tempfile
import shutil
from sklearn.metrics.pairwise import euclidean_distances
# Creat a temporary directory and define the array pat
path = tempfile.mkdtemp()
out_path = os.path.join(path,'out.mmap')
out = np.memmap(out_path, dtype=float, shape=eigen_vs_cleaned.shape[0], mode='w+')
eucl_dist_meaned = np.zeros(eigen_vs_cleaned.shape[0],dtype=float)
num_cores = multiprocessing.cpu_count()
def runparallel(row, out):
if row%10000==0:
print(row)
eucl_dist_temp = euclidean_distances(eigen_vs_cleaned[row].reshape(1, -1), eigen_vs_cleaned)
out[row] = eucl_dist_temp.mean(axis=1)
##
nothing = Parallel(n_jobs=num_cores)(delayed(runparallel)(r, out) for r in range(eigen_vs_cleaned.shape[0]))
Then I save the output:
eucl_dist_meaned = np.array(out,copy=True,dtype=float)
Upvotes: 1