Veridian Dynamics
Veridian Dynamics

Reputation: 241

Directed distances between a set of N points in 2D

I have N points in the plane stored as a (N, 2) shaped numpy array and I need to calculate the forces between them: F_ij = V(|r_i - r_j|)(r_i - r_j), where V is a function that only depends in the distance between the point i and j. In code-language, I need to run multiple times the following more efficiently:

import numpy as np
def V(dist):  #Here, dist is a float
    return np.exp(-dist)/dist


N = 10000
r = np.random.rand((N, 2))
F = np.zeros((N,N,2))

for i in range(N):
    for j in range(N):
        r_rel = r[i, :] - r[j, :]  #Relative position
        dist_rel = np.sqrt(r_rel[:, 0]**2 + r_rel[:, 1]**2)
        F[i, j, :] = V(dist_rel)*r_rel

This maybe can be performed more efficiently by using some memory-store trick (this 3-tensor is memory expensive), by doing only half of the double-for operations (because F[i,j,:]=-F[i,j,:]), etc. So, I ask for how to do this.

The function cdist of scipy module is very similar to this and run like x12 times faster than the above code, so, in one hand, maybe there is a function that do what I want in scipy/numpy and, in other hand, the above code can be written more efficiently.

Thanks for your help

Upvotes: 3

Views: 477

Answers (3)

Daniel F
Daniel F

Reputation: 14399

There is no reason to do the time-expensive np.exp and np.sqrt functions on distances that will cause np.exp(-dist) to underflow to zero. The smallest double greater than zero is 5e-324, and -np.log(5e-324) = 744.44007192138122. Now in your exampe the minimum distance is only ~1.4, but I assume that in practice the distances will be larger. With that in mind we can do:

import numpy as np

def V(dist_sq, thr = 5e-324):  #Here, dist_sq is dist**2, a float
    d_sq_thr = np.log(thr)**2  
    mask = dist_sq < d_sq_thr  # mask is True where you want to do calculations
    out = np.sqrt(dist_sq[mask]) # square root only in place and only for valid distances
    out = np.exp(-out) / out  # same with `np.exp`
    return out, mask

i, j = np.triu_indices(N, 1)  # gives you the valid (i, j) values
r_rel = r[i] - r[j]
sq_distances = np.sum(r_rel**2, axis = 1)  # squared distances
v_, mask = V(sq_distances)
F[i[mask], j[mask], :] = v_ * r_rel[mask]
F[j[mask], i[mask], :] = -F[i[mask], j[mask], :] 

Now, you can set your thr value to something larger than machine precision and speed things up even more, even before you start jumping into wrapping it innumba as @max9111 suggests. Not a numba expert, but I think this would work:

@nb.njit(parallel=True,error_model='numpy',fastmath=True)
def distance(r, thr = None):
    if thr == None:
        d_sq_thr = np.log(5e-324)**2
    else:
        d_sq_thr = np.log(thr)**2 
    N = r.shape[0]
    F = np.zeros((N,N,2))
    #give the compile all information available
    assert r.shape[1]==2
    for i in nb.prange(N):
        #change i to N to get all results
        for j in range(i):
            r_rel_1 = r[i,0] - r[j,0]
            r_rel_2 = r[i,1] - r[j,1]
            dist_rel_sq = r_rel_1 ** 2 + r_rel_2 ** 2
            if dist_rel_sq > d_sq_thr :
                continue
            dist_rel = np.sqrt(dist_rel_sq)
            V = np.exp(-dist_rel) / (dist_rel+1)
            F[i,j,0] = V * r_rel_1
            F[i,j,1] = V * r_rel_2
    return F

Keep in mind this method will be slower than the others on your toy problem, as the distance check is just overhead (as your r values are all bounded on [0,1]), but when larger distances are used this should speed things up - maybe even a lot.

Upvotes: 1

max9111
max9111

Reputation: 6482

This should be a comment to @Deepak Sainis answer. Apart from his already good answer there are some performance critical problems.

1. Declaration of arrays

nb.float64[:,:,:](nb.float64[:,:])

This means that both the input and output array can be non-contigous in memory. This often prevents SIMD-vectorization. If your arrays are definitly contigous you can declare

nb.float64[:,:,::1](nb.float64[:,::1])

which means the arrys are C-contigous.

Or you can simple drop the declaration, Numba will detect the datatype and the contigousness itself correctly.

2. Using many small loops which can be unrolled

Every additional loop comes with some overhead. In this example it is known that the second dim of the array r will be 2. You can unroll the "hidden loops" manually. This improves the performance more than an order of magnitude in this example.

#eg. this
r_rel = r[i] - r[j]
#can be rewritten to
r_rel_1 = r[i,0] - r[j,0]
r_rel_2 = r[i,1] - r[j,1]

3. Calculating the results only once and copying sounds like a good idea

F[j,i] = -1 * F[i,j]

This line prevents SIMD-vectorization and has a very high cost (read/write to arrays contigously) if you can. This simple operation takes almost a bit longer than calculating all the values directly. But if you know that there is some kind of symetry you can avoid this copying process completely.

4. Using np.zeros if you overwrite the results anyway

This has about (1/3) of the cost of calculating the upper half of the array. In the backend there is always a np.empty() to allocate memory. If you use np.zeros, there is a nested loop which writes zeros to the whole array which is very costly.

5. Parallelization

On small problems this can have negative impact on performance, but larger problems benefit from it. Caching isn't available for parallelized functions and the compilation time is higher.

Single threaded version

@nb.jit(fastmath=True,error_model='numpy',cache=True)
def distance(r):
    N = r.shape[0]
    F = np.empty((N,N,2))
    #give the compile all information available
    assert r.shape[1]==2
    for i in range(N):
        #change i to N to get all results
        for j in range(i):
            r_rel_1 = r[i,0] - r[j,0]
            r_rel_2 = r[i,1] - r[j,1]
            dist_rel = np.sqrt(r_rel_1 ** 2 + r_rel_2 ** 2)
            V = np.exp(-dist_rel) / (dist_rel+1)    
            F[i,j,0] = V * r_rel_1
            F[i,j,1] = V * r_rel_2
    return F

Parallel version

@nb.njit(parallel=True,error_model='numpy',fastmath=True)
def distance(r):
    N = r.shape[0]
    F = np.empty((N,N,2))
    #give the compile all information available
    assert r.shape[1]==2
    for i in nb.prange(N):
        #change i to N to get all results
        for j in range(i):
            r_rel_1 = r[i,0] - r[j,0]
            r_rel_2 = r[i,1] - r[j,1]
            dist_rel = np.sqrt(r_rel_1 ** 2 + r_rel_2 ** 2)
            V = np.exp(-dist_rel) / (dist_rel+1)
            F[i,j,0] = V * r_rel_1
            F[i,j,1] = V * r_rel_2
    return F

Timings

Core i7 Quadcoe 4th gen, Python 3.5, Numba 0.4dev There is always the second call to the function measured, to avoid measuring compilation overhead. There is only one half of the array calculated. I commented the line F[j,i] = -1 * F[i,j] for a direct comparison.

N=1000    
@Deepak Saini: 109ms
single_threaded: 6.4ms
parallel:5ms

N=10000
@Deepak Saini: 10.43s
single_threaded: 0.61s
parallel:0.36s

Upvotes: 2

Deepak Saini
Deepak Saini

Reputation: 2910

Two approaches come to my mind.

First, is to use the scipy cdist. But it has a couple of issues. Since, it expects the output of the custom distance function to be a scalar, so we need to calculate the x and y coordinates of the force separately. Second, it will do the calculations for all the pairs i,j and j,i again. So, as mentioned in the question we can reduce the calculation to n^2/2. On N = 1000, it takes 14.141 sec. For N = 10000, takes forever. Calculations made on 8 GB, 8 core mac.

import time
import numpy as np 
from scipy.spatial import distance

N = 1000
r = np.random.rand(N, 2)

def force(a, b):
    def V(dist):  #Here, dist is a float
        return np.exp(-dist)/(dist+1)

    r_rel = a - b  #Relative position
    dist_rel = np.sqrt(r_rel[0]**2 + r_rel[1]**2)
    return V(dist_rel)*r_rel

def force_x(a, b):
    return force(a,b)[0]
def force_y(a, b):
    return force(a, b)[1]

t1 = time.time()
# Calculate x and y coordinates separately
F_x = distance.cdist(r, r, metric=force_x)
F_y = distance.cdist(r, r, metric=force_y)
print("Time taken is = ", time.time() - t1) # takes 14.141s

Second, method which takes care of the n^2/2 optimization is to use the n^2 loop but to speed up the calculations using numba. If you have a gpu, you can further speed up by numba vectorize. On N = 1000, it takes 0.240 sec. For N = 10000, takes ~25 sec.

import numba as nb
@nb.jit(nb.float64[:,:,:](nb.float64[:,:]), nopython=True, cache=True)
def distance(r):
    N = r.shape[0]
    F = np.zeros((N,N,2))
    for i in range(N):
        for j in range(i):
            r_rel = r[i] - r[j]  #Relative position
            dist_rel = np.sqrt(r_rel[0] ** 2 + r_rel[1] ** 2)
            V = np.exp(-dist_rel) / (dist_rel+1)    
            F[i,j] = V * r_rel
            F[j,i] = -1 * F[i,j]
    return F

t1 = time.time()
F = distance(r)
print("Time taken is = ", time.time() - t1) # takes 0.240s

So, apart from making the O(n logn) method described in the link in the comments, the numba method seems to work for the example size in the question under acceptable time.

Upvotes: 2

Related Questions