Jakob
Jakob

Reputation: 61

Fast multiplication of values in an array

Is there a fast way to multiply values of a float array in C++, to optimize this function (where count is a multiple of 4):

void multiply(float* values, float factor, int count)
{
    for(int i=0; i < count; i++)
    {
        *value *= factor;
        value++;
    }
}

A solution must work on Mac OS X and Windows, Intel and non-Intel. Think SSE, vectorization, compiler (gcc vs. MSVC).

Upvotes: 6

Views: 9406

Answers (7)

Bigbohne
Bigbohne

Reputation: 1356

Have you thought of OpenMP?

Most modern computers have multi-core CPUs and nearly every major compiler seems to have OpenMP built-in. You gain speed at barely any cost.

See Wikipedia's article on OpenMP.

Upvotes: 2

sellibitze
sellibitze

Reputation: 28107

I think, there is not a lot you can do that makes a big difference. Maybe you can speed it up a little with OpenMP or SSE. But Modern CPUs are quite fast already. In some applications memory bandwidth / latency is actually the bottleneck and it gets worse. We already have three levels of cache and need smart prefetch algorithms to avoid huge delays. So, it makes sense to think about memory access patterns as well. For example, if you implement such a multiply and an add and use it like this:

void multiply(float vec[], float factor, int size)
{
  for (int i=0; i<size; ++i)
    vec[i] *= factor;
}

void add(float vec[], float summand, int size)
{
  for (int i=0; i<size; ++i)
    vec[i] += summand;
}

void foo(float vec[], int size)
{
  multiply(vec,2.f,size);
  add(vec,9.f,size);
}

you're basically passing twice over the block of memory. Depending on the vector's size it might not fit into the L1 cache in which case passing over it twice adds some extra time. This is obviously bad and you should try to keep memory accesses "local". In this case, a single loop

void foo(float vec[], int size)
{
  for (int i=0; i<size; ++i) {
    vec[i] = vec[i]*2+9;
  }
}

is likely to be faster. As a rule of thumb: Try to access memory linearly and try to access memory "locally" by which I mean, try to reuse the data that is already in the L1 cache. Just an idea.

Upvotes: 0

doron
doron

Reputation: 28892

As you mentioned, there are numerous architectures out there that have SIMD extensions and SIMD is probably your best bet when it comes to optimization. They are all however platform specific and the C and C++ as languages are not SIMD friendly.

The first thing you should try however is enabling the SIMD specific flags for your given build. The compiler may recognize patterns that can be optimized with SIMD.

The next thing is to write platform specific SIMD code using compiler intrinsics or assembly where appropriate. You should however keep a portable non-SIMD implementation for platforms that do not have an optimized version. #ifdefs enable SIMD on platforms that support it.

Lastly, at least on ARM but not sure on Intel, be aware that smaller integer and floating point types allow a larger number of parallel operations per single SIMD instruction.

Upvotes: 0

rwong
rwong

Reputation: 6162

Disclaimer: obviously, this won't work on iPhone, iPad, Android, or their future equivalents.

#include <mmintrin.h>
#include <xmmintrin.h>

__m128 factor4 = _mm_set1_ps(factor);
for (int i=0; i+3 < count; i += 4)
{
   __m128 data = _mm_mul_ps(_mm_loadu_ps(values), factor4);
   _mm_storeu_ps(values, data);
   values += 4;
}
for (int i=(count/4)*4; i < count; i++)
{
   *values *= factor;
   value++;
}

Upvotes: 2

G B
G B

Reputation: 3024

The best solution is to keep it simple, and let the compiler optimize it for you. GCC knows about SSE, SSE2, altivec and what else. If your code is too complex, your compiler won't be able to optimize it on every possible target.

Upvotes: 0

st0le
st0le

Reputation: 33545

Since you know the count is a multiple of 4, you can unroll your loop...

void multiply(float* values, float factor, int count)
{
    count = count >> 2; // count / 4
    for(int i=0; i < count ; i++)
    {
        *value *= factor;
        *(value+1) *= factor;
        *(value+2) *= factor;
        *(value+3) *= factor;
        value += 4;
    }
}

Upvotes: 2

Oliver Charlesworth
Oliver Charlesworth

Reputation: 272617

If you want your code to be cross-platform, then either you're going to have to write platform-independent code, or you're going to have to write a load of #ifdefs.

Have you tried some manual loop unrolling, and seeing if it makes any difference?

Upvotes: 2

Related Questions