Reputation: 73
The following is an IIR code. I need to vectorize the code so that I can write NEON code efficiently.
Example of vectorization Non vectorized code
for(i=0;i<100;i++)
a[i] =a[i]*b[i]; //only one independent multiplication cannot take
//advantage of multiple multiplication units
Vectorized code
for(i=0;i<25;i++)
{
a[i*4] =a[i*4]*b[i*4]; //four independent multiplications can use
a[(i+1)*4] =a[(i+1)*4]*b[(i+1)*4]; // multiple multiplication units to perform the
a[(i+2)*4] =a[(i+2)*4]*b[(i+2)*4]; //operation in parallel
a[(i+3)*4] =a[(i+3)*4]*b[(i+3)*4];
}
Please help me in vectorizing the for loop below so as to implement the code efficiently by using the vector capability of hardware (my hardware can perform 4 multiplications simultaneously).
main()
{
for(j=0;j<NUMBQUAD;j++)
{
for(i=2;i<SAMPLES+2 ;i++)
{
w[i] = x[i-2] + a1[j]* w[i-1] + a2[j]*w[i-2];
y[i-2] = w[i] + b1[j]* w[i-1] + b2[j]*w[i-2];
}
w[0]=0;
w[1] =0;
}
}
Upvotes: 2
Views: 437
Reputation: 20037
Once you have fixed (or verified) the equations, you should notice that there are 4 independent multiplications in each round of the equation. The task becomes in finding the proper and least number of instructions to permute input vectors x[...], y[...], w[...] to some register
q0 = | w[i-1] | w[i-2] | w[i-1] | w[i-2]|
q1 = | a1[j] | a2[j] | b1[j] | b2[j] | // vld1.32 {d0,d1}, [r1]!
q2 = q0 .* q1
A potentially much more effective method of wavefront parallelism can be achieved by inverting the for loops.
x0 = *x++;
w0 = x0 + a*w1 + b*w2; // pipeline warming stage
y0 = w0 + c*w1 + d*w2; //
[REPEAT THIS]
// W2 = W1; W1 = W0;
W0 = y0 + A*W1 + B*W2;
Y0 = W0 + C*W1 + D*W2;
// w2 = w1; w1 = w0;
x0 = *x++;
*output++= Y0;
w0 = x0 + a*w1 + b*w2;
y0 = w0 + c*w1 + d*w2;
[REPEAT ENDS]
W0 = y0 + A*W1 + B*W2; // pipeline cooling stage
Y0 = W0 + C*W1 + D*W2;
*output++= Y0;
While there are still dependencies between x0->w0->y0->W0->Y0, there's an opportunity of full 2-way parallelism in between lower-case and upper-case expressions. Also one can try to get rid of shifting the values w2=w1; w1=w0;
by unrolling the loop and doing manual register renaming.
Upvotes: 2