Samaursa
Samaursa

Reputation: 17197

Will the compiler unroll this loop?

I am creating a multi-dimensional vector (mathematical vector) where I allow basic mathematical operations +,-,/,*,=. The template takes in two parameters, one is the type (int, float etc.) while the other is the size of the vector. Currently I am applying the operations via a for loop. Now considering the size is known at compile time, will the compiler unroll the loop? If not, is there a way to unroll it with no (or minimal) performance penalty?

template <typename T, u32 size>
class Vector
{
public:
    // Various functions for mathematical operations. 
    // The functions take in a Vector<T, size>.
    // Example:
    void add(const Vector<T, size>& vec)
    {
        for (u32 i = 0; i < size; ++i)
        {
            values[i] += vec[i];
        }
    }

private:
    T   values[size];
};

Before somebody comments Profile then optimize please note that this is the basis for my 3D graphics engine and it must be fast. Second, I want to know for the sake of educating myself.

Upvotes: 10

Views: 7766

Answers (6)

decltype
decltype

Reputation: 1601

The loop can be unrolled using recursive template instantiation. This may or may not be faster on your C++ implementation.

I adjusted your example slightly, so that it would compile.

typedef unsigned u32; // or something similar

template <typename T, u32 size>
class Vector
{
  // need to use an inner class, because member templates of an 
  // unspecialized template cannot be explicitly specialized.
  template<typename Vec, u32 index>
  struct Inner
  {
    static void add(const Vec& a, const Vec& b)
    {
      a.values[index] = b.values[index];
      // triggers recursive instantiation of Inner 
      Inner<Vec, index-1>::add(a,b);
    }
  };
  // this specialization terminates the recursion
  template<typename Vec>
  struct Inner<Vec, 0>
  {
    static void add(const Vec& a, const Vec& b)
    {
      a.values[0] = b.values[0];
    }
  };

public:

    // PS! this function should probably take a 
    // _const_ Vector, because the argument is not modified
    // Various functions for mathematical operations. 
    // The functions take in a Vector<T, size>.
    // Example:
    void add(Vector<T, size>& vec)
    {
      Inner<Vector, size-1>::add(*this, vec);
    }

    T   values[size];
};

Upvotes: 3

klimkin
klimkin

Reputation: 630

You can do the following trick with disassembly to see how the particular code is compiled.

    Vector<int, 16> a, b;
    Vector<int, 65536> c, d;

    asm("xxx"); // marker
    a.Add(b);
    asm("yyy"); // marker
    c.Add(d);
    asm("zzz"); // marker

Now compile

gcc -O3 1.cc -S -o 1.s

And see the disasm

    xxx
# 0 "" 2
#NO_APP
    movdqa  524248(%rsp), %xmm0
    leaq    524248(%rsp), %rsi
    paddd   524184(%rsp), %xmm0
    movdqa  %xmm0, 524248(%rsp)
    movdqa  524264(%rsp), %xmm0
    paddd   524200(%rsp), %xmm0
    movdqa  %xmm0, 524264(%rsp)
    movdqa  524280(%rsp), %xmm0
    paddd   524216(%rsp), %xmm0
    movdqa  %xmm0, 524280(%rsp)
    movdqa  524296(%rsp), %xmm0
    paddd   524232(%rsp), %xmm0
    movdqa  %xmm0, 524296(%rsp)
#APP
# 36 "1.cc" 1
    yyy
# 0 "" 2
#NO_APP
    leaq    262040(%rsp), %rdx
    leaq    -104(%rsp), %rcx
    xorl    %eax, %eax
    .p2align 4,,10
    .p2align 3
.L2:
    movdqa  (%rcx,%rax), %xmm0
    paddd   (%rdx,%rax), %xmm0
    movdqa  %xmm0, (%rdx,%rax)
    addq    $16, %rax
    cmpq    $262144, %rax
    jne .L2
#APP
# 38 "1.cc" 1
    zzz

As you see, the first loop was small enough to get unrolled. The second is the loop.

Upvotes: 16

NPE
NPE

Reputation: 500167

First of all, it is not at all certain that unrolling the loop would be beneficial.

The only possible answer to your question is "it depends" (on the compiler flags, on the value of size, etc).

If you really want to know, ask your compiler: compile into assembly code with typical values of size and with the optimization flags you'd use for real, and examine the result.

Upvotes: 1

Seth Johnson
Seth Johnson

Reputation: 15172

The only way to figure this out is to try it on your own compiler with your own optimization parameters. Make one test file with your "does it unroll" code, test.cpp:

#include "myclass.hpp"

void doSomething(Vector<double, 3>& a, Vector<double, 3>& b) {
    a.add( b );
}

then a reference code snippet reference.cpp:

#include "myclass.hpp"

void doSomething(Vector<double, 3>& a, Vector<double, 3>& b) {
    a[0] += b[0];
    a[1] += b[1];
    a[2] += b[2];
}

and now use GCC to compile them and spit out only the assembly:

for x in *.cpp; do  g++ -c "$x" -Wall -Wextra -O2 -S -o "out/$x.s"; done

In my experience, GCC will unroll loops of 3 or less by default when using loops whose duration are known at compile time; using the -funroll-loops will cause it to unroll even more.

Upvotes: 1

Nemo
Nemo

Reputation: 71515

First: Modern CPUs are pretty smart about predicting branches, so unrolling the loop might not help (and could even hurt).

Second: Yes, modern compilers know how to unroll a loop like this, if it is a good idea for your target CPU.

Third: Modern compilers can even auto-vectorize the loop, which is even better than unrolling.

Bottom line: Do not think you are smarter than your compiler unless you know a lot about CPU architecture. Write your code in a simple, straightforward way, and do not worry about micro-optimizations until your profiler tells you to.

Upvotes: 7

Ben Voigt
Ben Voigt

Reputation: 283624

Many compilers will unroll this loop, no idea if "the compiler" you are referring to will. There isn't just one compiler in the world.

If you want to guarantee that it's unrolled, then TMP (with inlining) can do that. (This is actually one of the more trivial applications of TMP, often used as an example of metaprogramming).

Upvotes: 0

Related Questions