Reputation: 4063

Optimize computation of the "difference function"

My code call numerous "difference functions" to compute the "Yin algorithm" (fundamental frequency extractor).

The difference function (eq. 6 in the paper) is defined as:

And this is my implementation of the difference function:

def differenceFunction(x, W, tau_max):
   df = [0] * tau_max
   for tau in range(1, tau_max):
      for j in range(0, W - tau):
          tmp = long(x[j] - x[j + tau])
          df[tau] += tmp * tmp
   return df

For instance with:

x = np.random.randint(0, high=32000, size=2048, dtype='int16')
W = 2048
tau_max = 106
differenceFunction(x, W, tau_max)

Is there a way to optimize this double-loop computation (with python only, preferably without other libraries than numpy)?

EDIT: Changed code to avoid Index Error (j loop, @Elliot answer)

EDIT2: Changed code to use x[0] (j loop, @hynekcer comment)

Upvotes: 7

Answers (7)

kureta

Reputation: 483

I was trying to make sense of the fastest answer and I just came up with a faster and simpler solution.

def autocorrelation(x):
    result = np.correlate(x, x, mode='full')
    return result[result.size // 2:]

def difference(x):
    return np.dot(x, x) + (x * x)[::-1].cumsum()[::-1] - 2 * autocorrelation(x)

Solution is based on the difference function as defined in the YIN paper.

%%timeit
difference(frame)

140 µs ± 438 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Upvotes: 0

hynekcer

Reputation: 15548

EDIT: Improved speed to 220 µs - see edit at the end - direct version

The required calculation can be easily evaluated by Autocorrelation function or similarly by convolution. Wiener–Khinchin theorem allows computing the autocorrelation with two Fast Fourier transforms (FFT), with time complexity O(n log n). I use accellerated convolution function fftconvolve from Scipy package. An advantage is that it is easy to explain here why it works. Everything is vectorized, no loop at Python interpreter level.

from scipy.signal import fftconvolve

def difference_by_convol(x, W, tau_max):
    x = np.array(x, np.float64)
    w = x.size
    x_cumsum = np.concatenate((np.array([0.]), (x * x).cumsum()))
    conv = fftconvolve(x, x[::-1])
    df = x_cumsum[w:0:-1] + x_cumsum[w] - x_cumsum[:w] - 2 * conv[w - 1:]
    return df[:tau_max + 1]

Compared with differenceFunction_1loop function in Elliot's answer: It is faster with FFT: 430 µs compared to the original 1170 µs. It starts be faster for about tau_max >= 40. The numerical accuracy is great. The highest relative error is less then 1E-14 compared to exact integer result. (Therefore it could be easily rounded to the exact long integer solution.)
The parameter tau_max is not important for the algorithn. It only restricts the output finally. A zero element at index 0 is added to the output because indexes should start by 0 in Python.
The parameter W is not important in Python. The size is better to be introspected.
Data are converted to np.float64 initially to prevent repeated conversions. It is by half percent faster. Any type smaller than np.int64 would be unacceptable because of overflow.
The required difference function is double energy minus autocorrelation function. That can be evaluated by convolution: correlate(x, x) = convolve(x, reversed(x).
"As of Scipy v0.19 normal convolve automatically chooses this method or the direct method based on an estimation of which is faster." That heuristics is not adequate to this case because the convolution evaluates much more tau than tau_max and it must be outweighed by much faster FFT than a direct method.
It can be calculated also by Numpy ftp module without Scipy by rewriting the answer Calculate autocorrelation using FFT in matlab to Python (below at the end). I think that the solution above can be easier understand.

Proof: (for Pythonistas :-)

The original naive implementation can be written as:

df = [sum((x[j] - x[j + t]) ** 2   for j in range(w - t))  for t in range(tau_max + 1)]

where tau_max < w.

Derive by rule (a - b)**2 == a**2 + b**2 - 2 * a * b

df = [  sum(x[j] ** 2 for j in range(w - t))
      + sum(x[j] ** 2 for j in range(t, w))
      - 2 * sum(x[j] * x[j + t] for j in range(w - t))
      for t in range(tau_max + 1)]

Substitute the first two elements with help of x_cumsum = [sum(x[j] ** 2 for j in range(i)) for i in range(w + 1)] that can be easily calculated in linear time. Substitute sum(x[j] * x[j + t] for j in range(w - t)) by convolution conv = convolvefft(x, reversed(x), mode='full') that has output of size len(x) + len(x) - 1.

df = [x_cumsum[w - t] + x_cumsum[w] - x_cumsum[t]
      - 2 * convolve(x, x[::-1])[w - 1 + t]
      for t in range(tau_max + 1)]

Optimize by vector expressions:

df = x_cumsum[w:0:-1] + x_cumsum[w] - x_cumsum[:w] - 2 * conv[w - 1:]

Every step can be also tested and compared by test data numerically

EDIT: Implemented solution directly by Numpy FFT.

def difference_fft(x, W, tau_max):
    x = np.array(x, np.float64)
    w = x.size
    tau_max = min(tau_max, w)
    x_cumsum = np.concatenate((np.array([0.]), (x * x).cumsum()))
    size = w + tau_max 
    p2 = (size // 32).bit_length()
    nice_numbers = (16, 18, 20, 24, 25, 27, 30, 32)
    size_pad = min(x * 2 ** p2 for x in nice_numbers if x * 2 ** p2 >= size)
    fc = np.fft.rfft(x, size_pad)
    conv = np.fft.irfft(fc * fc.conjugate())[:tau_max]
    return x_cumsum[w:w - tau_max:-1] + x_cumsum[w] - x_cumsum[:tau_max] - 2 * conv

It is more than twice faster than my previous solution because the length of convolution is restricted to a nearest "nice" number with small prime factors after W + tau_max, not evaluated full 2 * W. It is also not necessary to transform the same data twice as it was with `fftconvolve(x, reversed(x)).

In [211]: %timeit differenceFunction_1loop(x, W, tau_max)
1.1 ms ± 4.51 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [212]: %timeit difference_by_convol(x, W, tau_max)
431 µs ± 5.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [213]: %timeit difference_fft(x, W, tau_max)
218 µs ± 685 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The newest solution is faster than Eliot's difference_by_convol for tau_max >= 20. That ratio doesn't depend much on data size because of similar ratio of overhead costs.

Upvotes: 6

Antimony

Reputation: 2240

Here's another approach using list comprehension. It takes approx less than a tenth of the time taken by the original function, but does not beat Elliot's answer. Just putting it out there anyway.

import numpy as np
import time

# original version
def differenceFunction_2loop(x, W, tau_max):
   df = np.zeros(tau_max, np.long)
   for tau in range(1, tau_max):
      for j in range(0, W - tau): # -tau eliminates the IndexError
          tmp = np.long(x[j] -x[j + tau])
          df[tau] += np.square(tmp)
   return df

# vectorized inner loop
def differenceFunction_1loop(x, W, tau_max):
    df = np.zeros(tau_max, np.long)
    for tau in range(1, tau_max):
        tmp = (x[:-tau]) - (x[tau:]).astype(np.long)
        df[tau] = np.dot(tmp, tmp)
    return df

# with list comprehension
def differenceFunction_1loop_listcomp(x, W, tau_max):
    df = [sum(((x[:-tau]) - (x[tau:]).astype(np.long))**2) for tau in range(1, tau_max)]
    return [0] + df[:]

x = np.random.randint(0, high=32000, size=2048, dtype='int16')
W = 2048
tau_max = 106

s = time.clock()
twoloop = differenceFunction_2loop(x, W, tau_max)
print(time.clock() - s)

s = time.clock()
oneloop = differenceFunction_1loop(x, W, tau_max)
print(time.clock() - s)

s = time.clock()
listcomprehension = differenceFunction_1loop_listcomp(x, W, tau_max)
print(time.clock() - s)

# confirm that the result comes out the same. 
print(np.all(twoloop == listcomprehension))
# True

Performance results (approximately):

differenceFunction_2loop() = 0.47s
differenceFunction_1loop() = 0.003s
differenceFunction_1loop_listcomp() = 0.033s

Upvotes: 1

ADR

Reputation: 1291

In opposite to optimize algorithm you can optimize interpreter with numba.jit:

import timeit

import numpy as np
from numba import jit


def differenceFunction(x, W, tau_max):
    df = [0] * tau_max
    for tau in range(1, tau_max):
        for j in range(0, W - tau):
            tmp = int(x[j] - x[j + tau])
            df[tau] += tmp * tmp
    return df


@jit
def differenceFunction2(x, W, tau_max):
    df = np.ndarray(shape=(tau_max,))
    for tau in range(1, tau_max):
        for j in range(0, W - tau):
            tmp = int(x[j] - x[j + tau])
            df[tau] += tmp * tmp
    return df


x = np.random.randint(0, high=32000, size=2048, dtype='int16')
W = 2048
tau_max = 106
differenceFunction(x, W, tau_max)


print('old',
      timeit.timeit('differenceFunction(x, W, tau_max)', 'from __main__ import differenceFunction, x, W, tau_max',
                    number=20) / 20)
print('new',
      timeit.timeit('differenceFunction2(x, W, tau_max)', 'from __main__ import differenceFunction2, x, W, tau_max',
                    number=20) / 20)

Result:

old 0.18265145074453273
new 0.016223197058214667

You can combine optimization of algorithm and numba.jit for a better result.

Upvotes: 1

Elliot

Reputation: 2690

First of all, you should consider the boundaries of the array. Your code as originally written would get an IndexError. You can get about a significant speedup by vectorizing the inner loop

import numpy as np

# original version
def differenceFunction_2loop(x, W, tau_max):
   df = np.zeros(tau_max, np.long)
   for tau in range(1, tau_max):
      for j in range(0, W - tau): # -tau eliminates the IndexError
          tmp = np.long(x[j] -x[j + tau])
          df[tau] += np.square(tmp)
   return df

# vectorized inner loop
def differenceFunction_1loop(x, W, tau_max):
    df = np.zeros(tau_max, np.long)
    for tau in range(1, tau_max):
        tmp = (x[:-tau]) - (x[tau:]).astype(np.long)
        df[tau] = np.dot(tmp, tmp)
    return df

x = np.random.randint(0, high=32000, size=2048, dtype='int16')
W = 2048
tau_max = 106
twoloop = differenceFunction_2loop(x, W, tau_max)
oneloop = differenceFunction_1loop(x, W, tau_max)

# confirm that the result comes out the same. 
print(np.all(twoloop == oneloop))
# True

Now for some benchmarking. In ipython I get the following

In [103]: %timeit twoloop = differenceFunction_2loop(x, W, tau_max)
1 loop, best of 3: 2.35 s per loop

In [104]: %timeit oneloop = differenceFunction_1loop(x, W, tau_max)
100 loops, best of 3: 8.23 ms per loop

So, about a 300 fold speedup.

Upvotes: 5

Joonatan Samuel

Reputation: 651

I would do something like this:

>>> x = np.random.randint(0, high=32000, size=2048, dtype='int16')
>>> tau_max = 106
>>> res = np.square((x[tau_max:] - x[:-tau_max]))

However I am convinced this is not the fastest way to do it.

Upvotes: 0

Gurkamal

Reputation: 71

I Don't know how you can find alternative to your nested loops problem but for arithmetic functions you can use numpy library. it is faster than manual operations.

import numpy as np
tmp = np.subtract(long(x[j] ,x[j + tau])

Upvotes: 0

Optimize computation of the &quot;difference function&quot;

Answers (7)

Related Questions

Optimize computation of the "difference function"