Reputation: 81

ARM NEON simple low pass filter vectorization

I have a simple single pole low pass filter (for parameter smoothing) that can be explained by the following formula:

y[n] = (1-a) * y[n-1] + a * x[n]

How to effective vectorize this case on ARM Neon - using intrinsics? Is it possible? The problem is that every computation need a previous result.

Upvotes: 4

Answers (4)

Jason R

Reputation: 11758

Assuming that you perform vector operations M elements at a time (I think NEON is 128 bits wide, so that would be M=4 32-bit elements), you can unroll the difference equation by a factor of M pretty easily for the simple single-pole filter. Assume that you have already calculated all outputs up to y[n]. Then, you can calculate the next four as follows:

y[n+1] = (1-a)*y[n] + a*x[n+1]
y[n+2] = (1-a)*y[n+1] + a*x[n+2] = (1-a)*((1-a)*y[n] + a*x[n+1]) + a*x[n+2]
       = (1-a)^2*y[n] + a*(1-a)*x[n+1] + a*x[n+2]
...

In general, you can write y[n+k] as:

y[n+k] = (1-a)^2*y[n] + sum_{i=1}^k a*(1-a)^{k-i}*x[n+i]

I know the above is difficult to read (maybe we can migrate this question over to Signal Processing and I can re-typeset in LaTeX). But, given an initial condition y[n] (which is assumed to be the last output calculated on the previous vectorized iteration), you can calculate the next M outputs in parallel, as the rest of the unrolled filter has an FIR-like structure.

There are some caveats to this approach: if M becomes large, then you end up multiplying a bunch of numbers together in order to get the effective FIR coefficients for the unrolled filters. Depending upon your number format and the value of a, this could have numerical precision implications. Also, you don't get an M-fold speedup with this approach: you end up calculating y[n+k] with what amounts to a k-tap FIR filter. Although you're calculating M outputs in parallel, the fact that you have to do k multiply-accumulate operations instead of the simple first-order recursive implementation diminishes some of the benefit to vectorization.

Upvotes: 3

Tobby

Reputation: 1

How about expanding equations to 4 steps and use matrix multiplication? a is constant so one matrix may be precalculated

Upvotes: 0

hotpaw2

Reputation: 70733

In general, you can only vectorize completely independent sets of computations. But in your IIR low pass, every output is dependent on another (except the 1st), so vectorization is not possible.

If your variable "a" is large enough that (1-a)^n quickly decays to below your desired noise floor or allowed error, you could substitute a short FIR filter approximation for your IIR, and vectorize that convolution instead. But that's not likely to be faster.

Upvotes: 0

Paul R

Reputation: 213130

You can only really vectorize this if you have more than one signal to which you wish to apply the same filter, e.g. if it's a stereo audio signal then you can process the left and right channel in parallel. Four or eight channels in parallel would obviously be even better.

Upvotes: 0

ARM NEON simple low pass filter vectorization

Answers (4)

Related Questions