Reputation: 734
I am doing an experiment to compute the approximate entropy of a signal. The details (and the actual code) can be found on its Wikipedia page. Unfortunately, while the algorithm itself works, it is very slow for the large dataset (for example, it takes roughly 25 seconds on a 2000 long signal). Since I to make this computation on many longer signal, at this speed, I would expect my experiment to last for at least 1 month. I was wondering if there is any way to speed up the algorithm.
import numpy as np
def ApEn(U, m, r):
def _maxdist(x_i, x_j):
return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
def _phi(m):
x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)]
C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
return (N - m + 1.0)**(-1) * sum(np.log(C))
N = len(U)
return abs(_phi(m + 1) - _phi(m))
Upvotes: 2
Views: 197
Reputation: 2465
If you are willing to move that function to cython and add some type annotations, there are significant performance gains to be made. Here is my version of this algorithm:
apen.pyx:
cimport cython
from libc.math cimport fabs, log
import numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
@cython.cdivision(True)
cdef double max_dist(double[:] x_i, double[:] x_j, int m) nogil:
#Performs the max function described in step 4 of ApEn algorithm
cdef double out
cdef double dist
out = fabs(x_i[0] - x_j[0])
for k in range(1, m - 1):
dist = fabs(x_i[k] - x_j[k])
if dist > out:
out = dist
return out
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
@cython.cdivision(True)
cdef double phi(double[:] Sn, int m, int r):
cdef int N = len(Sn)
cdef int i
cdef int j
cdef int k
cdef int c_val
cdef int counter
cdef double phi_sum = 0
cdef double phi
cdef double m_dist
#Performs step 3 of the ApEn algorithm
cdef double[:, :] x = np.empty((N - m + 1, m), dtype=np.float64)
with nogil:
for i in range(N - m + 1):
for j in range(0, m):
x[i, j] = Sn[j + i]
#Performs a combined steps 4 & 5 of the ApEn algorithm
for i in range(N - m + 1):
counter = 0
for j in range(N - m + 1):
m_dist = max_dist(x[i], x[j], m)
c_val = 1 if m_dist <= r else 0
counter += c_val
phi_sum += log(counter / (N - m + 1.0))
phi = phi_sum / (N - m + 1.0)
return phi
cpdef double approx_entropy(double[:] Sn, int m, int r):#Passing in steps 1 & 2 of the ApEn algorithm
cdef double ApEn = abs(phi(Sn, m, r) - phi(Sn, m + 1, r))#Performs step 6 of the ApEn algorithm
return ApEn
apen.pxd:
cdef double max_dist(double[:] x_i, double[:] x_j, int m) nogil
cdef double phi(double[:] Sn, int m, int r)
cpdef double approx_entropy(double[:] Sn, int m, int r)
setup.pxd:
from distutils.core import setup
from Cython.Build import cythonize
from distutils.core import Extension
import numpy as np
extensions = [
Extension("apen", sources=["apen.pyx"], include_dirs=[np.get_include()], extra_compile_args=["-w"]),
]
setup(
ext_modules = cythonize(extensions)
)
main.py:
import time
import apen
import numpy as np
start = time.time()
data = np.random.rand(2000)
#data = np.array([85, 80, 89] * 17, dtype=np.float64)
answer = apen.approx_entropy(Sn=data, m=2, r=3)
print(answer)
end = time.time()
print(end - start)
Using this code for 2000 random data points on my laptop, the cython code calculates ApEn in 0.36 s. In contrast, the wikipedia code takes 14.75 s. This amounts to about a 40x speed boost. Hope you find this helpful!
Upvotes: 1
Reputation: 6375
Normally when optimizing you should start from algorithmic optimizations which reduce complexity of the algorithm, not simply constants.
One rule of thumb is to look into the innermost loop - it contains the operations executing most of times.
I'm not sure that I correctly read the code, but it looks like U
is a matrix and _maxdist
makes calculations on its columns. In that case it makes sense to ensure that the calculation is performed only once per column.
For example calculate its value for every column, store in an array and use it in _phi
.
Upvotes: 1
Reputation: 433
I didn't look at the whole things, but to give you an example of how you can optimize the function using vector calculation:
def maxdist_opti(x_i,x_j):
return max(abs(x_i-x_j))
When your data is stored into numpy arrays, you can use numpy operators on them (and there are a lot of them, you can take a look here: https://docs.scipy.org/doc/numpy-1.13.0/user/index.html) and it'll be much quicker, in the case above I used soustraction and the np.max
function on numpy arrays.
Here, using random data:
x_i = np.random.rand(10000)
x_j = np.random.rand(10000)
The data used here is not that long, yet you can see a very nice gain of performance:
%timeit _maxdist(x_i,x_j)
100 loops, best of 3: 3.01 ms per loop
%timeit maxdist_opti(x_i,x_j)
10000 loops, best of 3: 28 µs per loop
You can use the following logic to do only vector calculation on the whole formula, and the gain in performance will be tremendous.
Note that the longer your data is, the more optimized it will be to use vector calculation.
Upvotes: 1