Uzu Lim
Uzu Lim

Reputation: 125

Efficient nearby distance matrix in Python

I need a memory & time efficient method to compute distances between about 50000 points in 1- to 10-dimensions, in Python. The methods I tried so far were not very good; so far, I tried:

To my surprise, the sparse_distance_matrix was badly underperforming. The example I used was 5000 points chosen uniformly from the unit 5-dimensional ball, where pdist returned me the result in 0.113 seconds and the sparse_distance_matrix returned me the result in 44.966 seconds, when I made it use the threshold 0.1 for the maximum distance cutoff.

At this point, I would just stick with pdist, but with 50000 points, it will be using a numpy array of 2.5 x 10^9 entries, and I'm concerned if it will overload the runtime (?) memory.

Does anyone know a better method, or sees a glaring mistake in my implementations? Thanks in advance!


Here's what's needed to reproduce the output in Python3:

import numpy as np
import math
import time
from scipy.spatial.distance import pdist
from scipy.spatial import KDTree as kdtree

# Generate a uniform sample of size N on the unit dim-dimensional sphere (which lives in dim+1 dimensions)
def sphere(N, dim):
    # Get a random sample of points from the (dim+1)-dim. Gaussian.
    output = np.random.multivariate_normal(mean=np.zeros(dim+1), cov=np.identity(dim+1), size=N)
    # Normalize output
    output = output / np.linalg.norm(output, axis=1).reshape(-1,1)
    return output

# Generate a uniform sample of size N on the unit dim-dimensional ball.
def ball(N, dim):
    # Populate the points on the unit sphere that is the boundary.
    sphere_points = sphere(N, dim-1)
    # Randomize radii of the points on the sphere using power law to get a uniform distribution on the ball.
    radii = np.power(np.random.random(N), 1/dim)
    output = radii.reshape(-1, 1) * sphere_points
    return output

N = 5000
dim = 5
r_cutoff = 0.1
# Generate a sample to test
sample = ball(N, dim)
# Construct a KD Tree for the sample
sample_kdt = kdtree(sample)

# pdist method for distance matrix
tic = time.monotonic()
pdist(sample)
toc = time.monotonic()
print(f"Time taken from pdist = {toc-tic}")

# KD Tree method for distance matrix
tic = time.monotonic()
sample_kdt.sparse_distance_matrix(sample_kdt, r_cutoff)
toc = time.monotonic()
print(f"Time taken from the KDTree method = {toc-tic}")

Upvotes: 1

Views: 1017

Answers (1)

Willem Hendriks
Willem Hendriks

Reputation: 1487

import numpy as np
from sklearn.neighbors import BallTree

tic = time.monotonic()

tree = BallTree(sample, leaf_size=10)       
d,i = tree.query(sample, k=1)

toc = time.monotonic()

print(f"Time taken from Sklearn BallTree = {toc-tic}")

This one did Time taken from Sklearn BallTree = 0.30803330009803176 on my machine. The pdist did little over a second. Note: I am doing some heavy calculations which I have 3/4 cores on my machine.

That one takes the closest k=1


For radius 0.1

import numpy as np
from sklearn.neighbors import BallTree

tic = time.monotonic()

tree = BallTree(sample, leaf_size=10)       
i = tree.query_radius(sample, r=0.1)

toc = time.monotonic()

print(f"Time taken from Sklearn BallTree Radius = {toc-tic}")

with speed

Time taken from Sklearn BallTree Radius = 0.11115029989741743

Upvotes: 3

Related Questions