Reputation: 241
I have N points in the plane stored as a (N, 2) shaped numpy array and I need to calculate the forces between them: F_ij = V(|r_i - r_j|)(r_i - r_j), where V is a function that only depends in the distance between the point i and j. In code-language, I need to run multiple times the following more efficiently:
import numpy as np
def V(dist): #Here, dist is a float
return np.exp(-dist)/dist
N = 10000
r = np.random.rand((N, 2))
F = np.zeros((N,N,2))
for i in range(N):
for j in range(N):
r_rel = r[i, :] - r[j, :] #Relative position
dist_rel = np.sqrt(r_rel[:, 0]**2 + r_rel[:, 1]**2)
F[i, j, :] = V(dist_rel)*r_rel
This maybe can be performed more efficiently by using some memory-store trick (this 3-tensor is memory expensive), by doing only half of the double-for operations (because F[i,j,:]=-F[i,j,:]), etc. So, I ask for how to do this.
The function cdist of scipy module is very similar to this and run like x12 times faster than the above code, so, in one hand, maybe there is a function that do what I want in scipy/numpy and, in other hand, the above code can be written more efficiently.
Thanks for your help
Upvotes: 3
Views: 477
Reputation: 14399
There is no reason to do the time-expensive np.exp
and np.sqrt
functions on distances that will cause np.exp(-dist)
to underflow to zero. The smallest double greater than zero is 5e-324
, and -np.log(5e-324) = 744.44007192138122
. Now in your exampe the minimum distance is only ~1.4, but I assume that in practice the distances will be larger. With that in mind we can do:
import numpy as np
def V(dist_sq, thr = 5e-324): #Here, dist_sq is dist**2, a float
d_sq_thr = np.log(thr)**2
mask = dist_sq < d_sq_thr # mask is True where you want to do calculations
out = np.sqrt(dist_sq[mask]) # square root only in place and only for valid distances
out = np.exp(-out) / out # same with `np.exp`
return out, mask
i, j = np.triu_indices(N, 1) # gives you the valid (i, j) values
r_rel = r[i] - r[j]
sq_distances = np.sum(r_rel**2, axis = 1) # squared distances
v_, mask = V(sq_distances)
F[i[mask], j[mask], :] = v_ * r_rel[mask]
F[j[mask], i[mask], :] = -F[i[mask], j[mask], :]
Now, you can set your thr
value to something larger than machine precision and speed things up even more, even before you start jumping into wrapping it innumba
as @max9111 suggests. Not a numba
expert, but I think this would work:
@nb.njit(parallel=True,error_model='numpy',fastmath=True)
def distance(r, thr = None):
if thr == None:
d_sq_thr = np.log(5e-324)**2
else:
d_sq_thr = np.log(thr)**2
N = r.shape[0]
F = np.zeros((N,N,2))
#give the compile all information available
assert r.shape[1]==2
for i in nb.prange(N):
#change i to N to get all results
for j in range(i):
r_rel_1 = r[i,0] - r[j,0]
r_rel_2 = r[i,1] - r[j,1]
dist_rel_sq = r_rel_1 ** 2 + r_rel_2 ** 2
if dist_rel_sq > d_sq_thr :
continue
dist_rel = np.sqrt(dist_rel_sq)
V = np.exp(-dist_rel) / (dist_rel+1)
F[i,j,0] = V * r_rel_1
F[i,j,1] = V * r_rel_2
return F
Keep in mind this method will be slower than the others on your toy problem, as the distance check is just overhead (as your r
values are all bounded on [0,1]), but when larger distances are used this should speed things up - maybe even a lot.
Upvotes: 1
Reputation: 6482
This should be a comment to @Deepak Sainis answer. Apart from his already good answer there are some performance critical problems.
1. Declaration of arrays
nb.float64[:,:,:](nb.float64[:,:])
This means that both the input and output array can be non-contigous in memory. This often prevents SIMD-vectorization. If your arrays are definitly contigous you can declare
nb.float64[:,:,::1](nb.float64[:,::1])
which means the arrys are C-contigous.
Or you can simple drop the declaration, Numba will detect the datatype and the contigousness itself correctly.
2. Using many small loops which can be unrolled
Every additional loop comes with some overhead. In this example it is known that the second dim of the array r will be 2. You can unroll the "hidden loops" manually. This improves the performance more than an order of magnitude in this example.
#eg. this
r_rel = r[i] - r[j]
#can be rewritten to
r_rel_1 = r[i,0] - r[j,0]
r_rel_2 = r[i,1] - r[j,1]
3. Calculating the results only once and copying sounds like a good idea
F[j,i] = -1 * F[i,j]
This line prevents SIMD-vectorization and has a very high cost (read/write to arrays contigously) if you can. This simple operation takes almost a bit longer than calculating all the values directly. But if you know that there is some kind of symetry you can avoid this copying process completely.
4. Using np.zeros if you overwrite the results anyway
This has about (1/3) of the cost of calculating the upper half of the array. In the backend there is always a np.empty() to allocate memory. If you use np.zeros, there is a nested loop which writes zeros to the whole array which is very costly.
5. Parallelization
On small problems this can have negative impact on performance, but larger problems benefit from it. Caching isn't available for parallelized functions and the compilation time is higher.
Single threaded version
@nb.jit(fastmath=True,error_model='numpy',cache=True)
def distance(r):
N = r.shape[0]
F = np.empty((N,N,2))
#give the compile all information available
assert r.shape[1]==2
for i in range(N):
#change i to N to get all results
for j in range(i):
r_rel_1 = r[i,0] - r[j,0]
r_rel_2 = r[i,1] - r[j,1]
dist_rel = np.sqrt(r_rel_1 ** 2 + r_rel_2 ** 2)
V = np.exp(-dist_rel) / (dist_rel+1)
F[i,j,0] = V * r_rel_1
F[i,j,1] = V * r_rel_2
return F
Parallel version
@nb.njit(parallel=True,error_model='numpy',fastmath=True)
def distance(r):
N = r.shape[0]
F = np.empty((N,N,2))
#give the compile all information available
assert r.shape[1]==2
for i in nb.prange(N):
#change i to N to get all results
for j in range(i):
r_rel_1 = r[i,0] - r[j,0]
r_rel_2 = r[i,1] - r[j,1]
dist_rel = np.sqrt(r_rel_1 ** 2 + r_rel_2 ** 2)
V = np.exp(-dist_rel) / (dist_rel+1)
F[i,j,0] = V * r_rel_1
F[i,j,1] = V * r_rel_2
return F
Timings
Core i7 Quadcoe 4th gen, Python 3.5, Numba 0.4dev
There is always the second call to the function measured, to avoid measuring compilation overhead. There is only one half of the array calculated. I commented the line F[j,i] = -1 * F[i,j]
for a direct comparison.
N=1000
@Deepak Saini: 109ms
single_threaded: 6.4ms
parallel:5ms
N=10000
@Deepak Saini: 10.43s
single_threaded: 0.61s
parallel:0.36s
Upvotes: 2
Reputation: 2910
Two approaches come to my mind.
First, is to use the scipy cdist. But it has a couple of issues. Since, it expects the output of the custom distance function to be a scalar, so we need to calculate the x and y coordinates of the force separately. Second, it will do the calculations for all the pairs i,j and j,i again. So, as mentioned in the question we can reduce the calculation to n^2/2. On N = 1000, it takes 14.141 sec. For N = 10000, takes forever. Calculations made on 8 GB, 8 core mac.
import time
import numpy as np
from scipy.spatial import distance
N = 1000
r = np.random.rand(N, 2)
def force(a, b):
def V(dist): #Here, dist is a float
return np.exp(-dist)/(dist+1)
r_rel = a - b #Relative position
dist_rel = np.sqrt(r_rel[0]**2 + r_rel[1]**2)
return V(dist_rel)*r_rel
def force_x(a, b):
return force(a,b)[0]
def force_y(a, b):
return force(a, b)[1]
t1 = time.time()
# Calculate x and y coordinates separately
F_x = distance.cdist(r, r, metric=force_x)
F_y = distance.cdist(r, r, metric=force_y)
print("Time taken is = ", time.time() - t1) # takes 14.141s
Second, method which takes care of the n^2/2 optimization is to use the n^2 loop but to speed up the calculations using numba. If you have a gpu, you can further speed up by numba vectorize
. On N = 1000, it takes 0.240 sec. For N = 10000, takes ~25 sec.
import numba as nb
@nb.jit(nb.float64[:,:,:](nb.float64[:,:]), nopython=True, cache=True)
def distance(r):
N = r.shape[0]
F = np.zeros((N,N,2))
for i in range(N):
for j in range(i):
r_rel = r[i] - r[j] #Relative position
dist_rel = np.sqrt(r_rel[0] ** 2 + r_rel[1] ** 2)
V = np.exp(-dist_rel) / (dist_rel+1)
F[i,j] = V * r_rel
F[j,i] = -1 * F[i,j]
return F
t1 = time.time()
F = distance(r)
print("Time taken is = ", time.time() - t1) # takes 0.240s
So, apart from making the O(n logn) method described in the link in the comments, the numba method seems to work for the example size in the question under acceptable time.
Upvotes: 2