Reputation: 9
I've got an application and Ive done some speed profiling on my code and the limiting factor is the multiplication of two arrays. I've got one array X as follows (in my real code is it a 2000 elements long not 15) :
a b c d e f g h i j k l m n o
and a second array Y (in my real code it is 300 elements long not 6) as follows :
A B C D E F
I then multiply the arrays as follows to create a new array C:
value 1 A*a+B*b+C*c+D*d+E*e+F*f
value 2 A*b+B*c+C*d+D*e+E*f+F*g
value 3 A*c+B*d+C*e+D*f+E*g+F*h
value 4 A*d+B*e+C*f+D*g+E*h+F*i
value 5 A*e+B*f+C*g+D*h+E*i+F*j
value 6 A*f+B*g+C*h+D*i+E*j+F*k
value 7 A*g+B*h+C*i+D*j+E*k+F*l
value 8 A*h+B*i+C*j+D*k+E*l+F*m
value 9 A*i+B*j+C*k+D*l+E*m+F*n
value 10 A*j+B*k+C*l+D*m+E*n+F*o
I was wondering if there was a way to speed this code up. I currently just use the code:
for (int l=0; l<X.length; l++) {
val=0;
for (int i=0; i<X.length; i++)
for (int j=0; j<J.length; j++)
val+=X[i]*Y[j];
C[l]=val;
}
I though I could maybe use Fourier transforms but it is slower than multiplying it out.
Upvotes: 0
Views: 466
Reputation: 9311
I would recommend using a BLAS implementation, which is the de facto standard scientific library for performing basic vector/matrix operations. There's probably a highly optimized version of BLAS available for your architecture of choice, using SIMD instructions (if there are any) and taking care of pipeline saturation to achieve maximum throughput. For example: OS X and iOS ships with a BLAS implementation in the Accelerate.framework.
(Please also note, that your code doesn't do convolution. val only depends on i and j, so each element of C will be the same: the sum of the products of each pair from X and J. )
Upvotes: 0