Reputation: 17197
I am creating a multi-dimensional vector (mathematical vector) where I allow basic mathematical operations +,-,/,*,=. The template takes in two parameters, one is the type (int, float etc.) while the other is the size of the vector. Currently I am applying the operations via a for loop. Now considering the size is known at compile time, will the compiler unroll the loop? If not, is there a way to unroll it with no (or minimal) performance penalty?
template <typename T, u32 size>
class Vector
{
public:
// Various functions for mathematical operations.
// The functions take in a Vector<T, size>.
// Example:
void add(const Vector<T, size>& vec)
{
for (u32 i = 0; i < size; ++i)
{
values[i] += vec[i];
}
}
private:
T values[size];
};
Before somebody comments Profile then optimize
please note that this is the basis for my 3D graphics engine and it must be fast. Second, I want to know for the sake of educating myself.
Upvotes: 10
Views: 7766
Reputation: 1601
The loop can be unrolled using recursive template instantiation. This may or may not be faster on your C++ implementation.
I adjusted your example slightly, so that it would compile.
typedef unsigned u32; // or something similar
template <typename T, u32 size>
class Vector
{
// need to use an inner class, because member templates of an
// unspecialized template cannot be explicitly specialized.
template<typename Vec, u32 index>
struct Inner
{
static void add(const Vec& a, const Vec& b)
{
a.values[index] = b.values[index];
// triggers recursive instantiation of Inner
Inner<Vec, index-1>::add(a,b);
}
};
// this specialization terminates the recursion
template<typename Vec>
struct Inner<Vec, 0>
{
static void add(const Vec& a, const Vec& b)
{
a.values[0] = b.values[0];
}
};
public:
// PS! this function should probably take a
// _const_ Vector, because the argument is not modified
// Various functions for mathematical operations.
// The functions take in a Vector<T, size>.
// Example:
void add(Vector<T, size>& vec)
{
Inner<Vector, size-1>::add(*this, vec);
}
T values[size];
};
Upvotes: 3
Reputation: 630
You can do the following trick with disassembly to see how the particular code is compiled.
Vector<int, 16> a, b;
Vector<int, 65536> c, d;
asm("xxx"); // marker
a.Add(b);
asm("yyy"); // marker
c.Add(d);
asm("zzz"); // marker
Now compile
gcc -O3 1.cc -S -o 1.s
And see the disasm
xxx
# 0 "" 2
#NO_APP
movdqa 524248(%rsp), %xmm0
leaq 524248(%rsp), %rsi
paddd 524184(%rsp), %xmm0
movdqa %xmm0, 524248(%rsp)
movdqa 524264(%rsp), %xmm0
paddd 524200(%rsp), %xmm0
movdqa %xmm0, 524264(%rsp)
movdqa 524280(%rsp), %xmm0
paddd 524216(%rsp), %xmm0
movdqa %xmm0, 524280(%rsp)
movdqa 524296(%rsp), %xmm0
paddd 524232(%rsp), %xmm0
movdqa %xmm0, 524296(%rsp)
#APP
# 36 "1.cc" 1
yyy
# 0 "" 2
#NO_APP
leaq 262040(%rsp), %rdx
leaq -104(%rsp), %rcx
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L2:
movdqa (%rcx,%rax), %xmm0
paddd (%rdx,%rax), %xmm0
movdqa %xmm0, (%rdx,%rax)
addq $16, %rax
cmpq $262144, %rax
jne .L2
#APP
# 38 "1.cc" 1
zzz
As you see, the first loop was small enough to get unrolled. The second is the loop.
Upvotes: 16
Reputation: 500167
First of all, it is not at all certain that unrolling the loop would be beneficial.
The only possible answer to your question is "it depends" (on the compiler flags, on the value of size
, etc).
If you really want to know, ask your compiler: compile into assembly code with typical values of size
and with the optimization flags you'd use for real, and examine the result.
Upvotes: 1
Reputation: 15172
The only way to figure this out is to try it on your own compiler with your own optimization parameters. Make one test file with your "does it unroll" code, test.cpp
:
#include "myclass.hpp"
void doSomething(Vector<double, 3>& a, Vector<double, 3>& b) {
a.add( b );
}
then a reference code snippet reference.cpp
:
#include "myclass.hpp"
void doSomething(Vector<double, 3>& a, Vector<double, 3>& b) {
a[0] += b[0];
a[1] += b[1];
a[2] += b[2];
}
and now use GCC to compile them and spit out only the assembly:
for x in *.cpp; do g++ -c "$x" -Wall -Wextra -O2 -S -o "out/$x.s"; done
In my experience, GCC will unroll loops of 3 or less by default when using loops whose duration are known at compile time; using the -funroll-loops
will cause it to unroll even more.
Upvotes: 1
Reputation: 71515
First: Modern CPUs are pretty smart about predicting branches, so unrolling the loop might not help (and could even hurt).
Second: Yes, modern compilers know how to unroll a loop like this, if it is a good idea for your target CPU.
Third: Modern compilers can even auto-vectorize the loop, which is even better than unrolling.
Bottom line: Do not think you are smarter than your compiler unless you know a lot about CPU architecture. Write your code in a simple, straightforward way, and do not worry about micro-optimizations until your profiler tells you to.
Upvotes: 7
Reputation: 283624
Many compilers will unroll this loop, no idea if "the compiler" you are referring to will. There isn't just one compiler in the world.
If you want to guarantee that it's unrolled, then TMP (with inlining) can do that. (This is actually one of the more trivial applications of TMP, often used as an example of metaprogramming).
Upvotes: 0