Reputation: 125
I need a memory & time efficient method to compute distances between about 50000 points in 1- to 10-dimensions, in Python. The methods I tried so far were not very good; so far, I tried:
scipy.spatial.distance.pdist
computes the full distance matrixscipy.spatial.KDTree.sparse_distance_matrix
computes the sparse distance matrix up to a thresholdTo my surprise, the sparse_distance_matrix
was badly underperforming. The example I used was 5000 points chosen uniformly from the unit 5-dimensional ball, where pdist
returned me the result in 0.113 seconds and the sparse_distance_matrix
returned me the result in 44.966 seconds, when I made it use the threshold 0.1 for the maximum distance cutoff.
At this point, I would just stick with pdist
, but with 50000 points, it will be using a numpy array of 2.5 x 10^9 entries, and I'm concerned if it will overload the runtime (?) memory.
Does anyone know a better method, or sees a glaring mistake in my implementations? Thanks in advance!
Here's what's needed to reproduce the output in Python3:
import numpy as np
import math
import time
from scipy.spatial.distance import pdist
from scipy.spatial import KDTree as kdtree
# Generate a uniform sample of size N on the unit dim-dimensional sphere (which lives in dim+1 dimensions)
def sphere(N, dim):
# Get a random sample of points from the (dim+1)-dim. Gaussian.
output = np.random.multivariate_normal(mean=np.zeros(dim+1), cov=np.identity(dim+1), size=N)
# Normalize output
output = output / np.linalg.norm(output, axis=1).reshape(-1,1)
return output
# Generate a uniform sample of size N on the unit dim-dimensional ball.
def ball(N, dim):
# Populate the points on the unit sphere that is the boundary.
sphere_points = sphere(N, dim-1)
# Randomize radii of the points on the sphere using power law to get a uniform distribution on the ball.
radii = np.power(np.random.random(N), 1/dim)
output = radii.reshape(-1, 1) * sphere_points
return output
N = 5000
dim = 5
r_cutoff = 0.1
# Generate a sample to test
sample = ball(N, dim)
# Construct a KD Tree for the sample
sample_kdt = kdtree(sample)
# pdist method for distance matrix
tic = time.monotonic()
pdist(sample)
toc = time.monotonic()
print(f"Time taken from pdist = {toc-tic}")
# KD Tree method for distance matrix
tic = time.monotonic()
sample_kdt.sparse_distance_matrix(sample_kdt, r_cutoff)
toc = time.monotonic()
print(f"Time taken from the KDTree method = {toc-tic}")
Upvotes: 1
Views: 1017
Reputation: 1487
import numpy as np
from sklearn.neighbors import BallTree
tic = time.monotonic()
tree = BallTree(sample, leaf_size=10)
d,i = tree.query(sample, k=1)
toc = time.monotonic()
print(f"Time taken from Sklearn BallTree = {toc-tic}")
This one did Time taken from Sklearn BallTree = 0.30803330009803176
on my machine. The pdist
did little over a second. Note: I am doing some heavy calculations which I have 3/4 cores on my machine.
That one takes the closest k=1
For radius 0.1
import numpy as np
from sklearn.neighbors import BallTree
tic = time.monotonic()
tree = BallTree(sample, leaf_size=10)
i = tree.query_radius(sample, r=0.1)
toc = time.monotonic()
print(f"Time taken from Sklearn BallTree Radius = {toc-tic}")
with speed
Time taken from Sklearn BallTree Radius = 0.11115029989741743
Upvotes: 3