jjcasmar
jjcasmar

Reputation: 1665

Performance differences in Eigen between auto and Eigen::Ref and concrete type

I am trying to understand how Eigen::Ref works to see if I can take some advantage of it in my code.

I have designed a benchmark like this

static void value(benchmark::State &state) {
  for (auto _ : state) {
    const Eigen::Matrix<double, Eigen::Dynamic, 1> vs =
        Eigen::Matrix<double, 9, 1>::Random();
    auto start = std::chrono::high_resolution_clock::now();

    const Eigen::Vector3d v0 = vs.segment<3>(0);
    const Eigen::Vector3d v1 = vs.segment<3>(3);
    const Eigen::Vector3d v2 = vs.segment<3>(6);
    const Eigen::Vector3d vt = v0 + v1 + v2;
    const Eigen::Vector3d v = vt.transpose() * vt * vt + vt;

    benchmark::DoNotOptimize(v);
    auto end = std::chrono::high_resolution_clock::now();

    auto elapsed_seconds =
        std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
    state.SetIterationTime(elapsed_seconds.count());
  }
}

I have two more tests like thise, one using const Eigen::Ref<const Eigen::Vector3D> and auto for the v0, v1, v2, vt.

The results of this benchmarks are

Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
value/manual_time               23.4 ns          113 ns     29974946
ref/manual_time                 23.0 ns          111 ns     29934053
with_auto/manual_time           23.6 ns          112 ns     29891056

As you can see, all the tests behave exactly the same. So I thought that maybe the compiler was doing its magic and decided to test with -O0. These are the results:

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
value/manual_time               2475 ns         3070 ns       291032
ref/manual_time                 2482 ns         3077 ns       289258
with_auto/manual_time           2436 ns         3012 ns       263170

Again, the three cases behave the same.

If I understand correctly, the first case, using Eigen::Vector3d should be slower, as it has to keep the copies, perform the v0+v1+v2` operation and save it, and then perform another operation and save.

The auto case should be the fastest, as it should be skipping all the writings.

The ref case I think it should be as fast as auto. If I understand correctly, all my operations can be stored in a reference to a const Eigen::Vector3d, so the operations should be skipped, right?

Why are the results all the same? Am I misunderstanding something or is the benchmark just bad designed?

Upvotes: 0

Views: 513

Answers (2)

Homer512
Homer512

Reputation: 13310

Because everything happens inside a function, the compiler can do escape analysis and optimize away the copies into the vectors.

To check this, I put the code in a function, to look at the assembler:

Eigen::Vector3d foo(const Eigen::VectorXd& vs)
{
    const Eigen::Vector3d v0 = vs.segment<3>(0);
    const Eigen::Vector3d v1 = vs.segment<3>(3);
    const Eigen::Vector3d v2 = vs.segment<3>(6);
    const Eigen::Vector3d vt = v0 + v1 + v2;
    return vt.transpose() * vt * vt + vt;
}

which turns into this assembler

        push    rax
        mov     rax, qword ptr [rsi + 8]
...
        mov     rax, qword ptr [rsi]
        movupd  xmm0, xmmword ptr [rax]
        movsd   xmm1, qword ptr [rax + 16]
        movupd  xmm2, xmmword ptr [rax + 24]
        addpd   xmm2, xmm0
        movupd  xmm0, xmmword ptr [rax + 48]
        addsd   xmm1, qword ptr [rax + 40]
        addpd   xmm0, xmm2
        addsd   xmm1, qword ptr [rax + 64]
...
        movupd  xmmword ptr [rdi], xmm2
        addsd   xmm3, xmm1
        movsd   qword ptr [rdi + 16], xmm3
        mov     rax, rdi
        pop     rcx
        ret

Notice how the only memory operations are two GP register loads to get start pointer and length, then a couple of memory loads to get the vector content into registers, before we write the result to memory in the end.

This only works since we deal with fixed-sized vectors. With VectorXd copies would definitely take place.

Alternative benchmarks

Ref is typically used on function calls. Why not try it with a function that cannot be inlined? Or come up with an example where escape analysis cannot work and the objects really have to be materialized. Something like this:

struct Foo
{
public:
    Eigen::Vector3d v0;
    Eigen::Vector3d v1;
    Eigen::Vector3d v2;
    
    Foo(const Eigen::VectorXd& vs) __attribute__((noinline));
    Eigen::Vector3d operator()() const __attribute__((noinline));
};

Foo::Foo(const Eigen::VectorXd& vs)
: v0(vs.segment<3>(0)),
  v1(vs.segment<3>(3)),
  v2(vs.segment<3>(6))
{}
Eigen::Vector3d Foo::operator()() const
{
    const Eigen::Vector3d vt = v0 + v1 + v2;
    return vt.transpose() * vt * vt + vt;
}
Eigen::Vector3d bar(const Eigen::VectorXd& vs)
{
    Foo f(vs);
    return f();
}

By splitting initialization and usage into non-inline functions, the copies really have to be done. Of course we now change the entire use case. You have to decide if this is still relevant to you.

Purpose of Ref

Ref exists for the sole purpose of providing a function interface that can accept both a full matrix/vector and a slice of one. Consider this:

Eigen::VectorXd foo(const Eigen::VectorXd&)

This interface can only accept a full vector as its input. If you want to call foo(vector.head(10)) you have to allocate a new vector to hold the vector segment. Likewise it always returns a newly allocated vector which is wasteful if you want to call it as output.head(10) = foo(input). So we can instead write

void foo(Eigen::Ref<Eigen::VectorXd> out, const Eigen::Ref<const Eigen::VectorXd>& in);

and use it as foo(output.head(10), input.head(10)) without any copies being created. This is only ever useful across compilation units. If you have one cpp file declaring a function that is used in another, Ref allows this to happen. Within a cpp file, you can simply use a template.

template<class Derived1, class Derived2>
void foo(const Eigen::MatrixBase<Derived1>& out,
         const Eigen::MatrixBase<Derived2>& in)
{
    Eigen::MatrixBase<Derived1>& mutable_out =
          const_cast<Eigen::MatrixBase<Derived1>&>(out);
    mutable_out = ...;
}

A template will always be faster because it can make use of the concrete data type. For example if you pass an entire vector, Eigen knows that the array is properly aligned. And in a full matrix it knows that there is no stride between columns. It doesn't know either of these with Ref. In this regard, Ref is just a fancy wrapper around Eigen::Map<Type, Eigen::Unaligned, Eigen::OuterStride<>>.

Likewise there are cases where Ref has to create temporary copies. The most common case is if the inner stride is not 1. This happens for example if you pass a row of a matrix (but not a column. Eigen is column-major by default) or the real part of a complex valued matrix. You will not even receive a warning for this, your code will simply run slower than expected.

The only reasons to use Ref inside a single cpp file are

  1. To make the code more readable. That template pattern shown above certainly doesn't tell you much about the expected types
  2. To reduce code size, which may have a performance benefit, but usually doesn't. It does help with compile time, though

Use with fixed-size types

Since your use case seems to involve fixed-sized vectors, let's consider this case in particular and look at the internals.

void foo(const Eigen::Vector3d&);
void bar(const Eigen::Ref<const Eigen::Vector3d>&);

int main()
{
    Eigen::VectorXd in = ...;
    foo(in.segment<3>(6));
    bar(in.segment<3>(6));
}

The following things will happen when you call foo:

  1. We copy 3 doubles from in[6] to the stack. This takes 4 instructions (2 movapd, 2 movsd).
  2. A pointer to these values is passed to foo. (Even fixed-size Eigen vectors declare a destructor, therefore they are always passed on the stack, even if we declare them pass-by-value)
  3. foo then loads the values via that pointer, taking 2 instructions (movapd + movsd)

The following happens when we call bar:

  1. We create a Ref<Vector> object. For this, we put a pointer to in.data() + 6 on the stack
  2. A pointer to this pointer is passed to bar
  3. bar loads the pointer from the stack, then loads the values

Notice how there is barely any difference. Maybe Ref saves a few instructions but it also introduces one indirection more. Compared to everything else, this is hardly significant. It is certainly too little to measure.

We are also entering the realm of microoptimizations here. This can lead to situations like this, where just the arrangement of code results in measurably different optimizations. Eigen: Why is Map slower than Vector3d for this template expression?

Upvotes: 2

J&#233;r&#244;me Richard
J&#233;r&#244;me Richard

Reputation: 50307

One big issue with the benchmark is that you measure the time in the hot benchmarking loop. The thing is measuring the time take some time and it can be far more expensive than the actual computation. In fact, I think this is what is happening in your case. Indeed, on Clang 13 with -O3, here is the assembly code actually benchmarked (available on GodBolt):

        mov     rbx, rax
        mov     rax, qword ptr [rsp + 24]
        cmp     rax, 2
        jle     .LBB0_17
        cmp     rax, 5
        jle     .LBB0_17
        cmp     rax, 8
        jle     .LBB0_17
        mov     rax, qword ptr [rsp + 16]
        movupd  xmm0, xmmword ptr [rax]
        movsd   xmm1, qword ptr [rax + 16]      # xmm1 = mem[0],zero
        movupd  xmm2, xmmword ptr [rax + 24]
        addpd   xmm2, xmm0
        movupd  xmm0, xmmword ptr [rax + 48]
        addsd   xmm1, qword ptr [rax + 40]
        addpd   xmm0, xmm2
        addsd   xmm1, qword ptr [rax + 64]
        movapd  xmm2, xmm0
        mulpd   xmm2, xmm0
        movapd  xmm3, xmm2
        unpckhpd        xmm3, xmm2                      # xmm3 = xmm3[1],xmm2[1]
        addsd   xmm3, xmm2
        movapd  xmm2, xmm1
        mulsd   xmm2, xmm1
        addsd   xmm2, xmm3
        movapd  xmm3, xmm1
        mulsd   xmm3, xmm2
        unpcklpd        xmm2, xmm2                      # xmm2 = xmm2[0,0]
        mulpd   xmm2, xmm0
        addpd   xmm2, xmm0
        movapd  xmmword ptr [rsp + 32], xmm2
        addsd   xmm3, xmm1
        movsd   qword ptr [rsp + 48], xmm3

This code can be executed in few dozens of cycles so probably less than 10-15 ns on a 4~5 GHz modern x86 processor. Meanwhile high_resolution_clock::now() should use a RDTSC/RDTSCP instruction that also takes dozens of cycles to complete. For example, on a Skylake processor, it should take about 25 cycles (similar on newer Intel processor). On an AMD Zen processor, it takes about 35-38 cycles. Additionally, it adds a synchronization that may not be representative of the actual application. Please consider measuring the time of a benchmarking loop with many iterations.

Upvotes: 3

Related Questions