user13492
user13492

Reputation: 87

C++ numeric code (using Eigen) surprisingly slow

I'm a physicist used to writing code in MATLAB / Python / Julia, considering using C++ for some performance sensitive code. I've written one of the most performance sensitive functions currently of interest to me - the computation of the Rotne-Prager tensor - in the form of a concise benchmark and implemented it in C++, Julia and MATLAB. The C++ version is currently slower than the Julia code by a factor of 8, leading me to conclude that I've written some truly horrible C++. The code reads as follows:

#include "stdafx.h"
#include <iostream>
#include <random>
#include <chrono>
#include <Eigen/Dense>

using namespace Eigen;
using namespace std;
using namespace std::chrono;

Matrix3d outer_product_3d(Vector3d a, Vector3d b)
{
    Matrix3d m;
    for (int i = 0; i < 3; i++)
    {
        for (int j = 0; j < 3; j++)
        {
            m(i, j) = a(i) * b(j);
        }
    }
    return m;
}

MatrixXd rotne_prager(Vector3d coords[], double r, const int p)
{
    MatrixXd rpy = MatrixXd::Zero(3*(p+1),3*(p+1));
    Vector3d rvec;
    double Rij;
    double distance_ratio;
    Matrix3d I = Matrix3d::Identity(3, 3);
    for (int i = 0; i < p + 1; i++)
    {
        for (int j = 0; j < p + 1; j++)
        {
            rvec(0) = coords[j](0) - coords[i](0);
            rvec(1) = coords[j](1) - coords[i](1);
            rvec(2) = coords[j](2) - coords[i](2);
            Rij = sqrt(rvec(0)*rvec(0) + rvec(1)*rvec(1) + rvec(2)*rvec(2));
            distance_ratio = r / Rij;
            if (Rij > 2 * r)
            {
                rpy.block(3 * (i + 1) - 3, 3 * (j + 1) - 3, 3, 3) =
                    0.75 * distance_ratio * ((1.0 + 2.0 / 3.0 * distance_ratio * distance_ratio) * I
                    + (1.0 - 2.0*distance_ratio * distance_ratio) * outer_product_3d(rvec, rvec));
            }
            else
            {
                rpy.block(3 * (i + 1) - 3, 3 * (j + 1) - 3, 3, 3) = 
                    (1.0 - 9.0 / 32.0 / distance_ratio) * I +
                    3.0 / 32.0 / distance_ratio * outer_product_3d(rvec, rvec);
            }
        }
    }
    return rpy;
}

int main()
{
    random_device rd;
    mt19937 gen(rd());
    uniform_real_distribution<> dis(0, 1);

    auto start = high_resolution_clock::now();

    const int p = 1000;
    Vector3d coords[p + 1];
    for (int i = 0; i < p + 1; i++)
    {
        coords[i](0) = dis(gen); coords[i](1) = dis(gen); coords[i](2) = dis(gen);
    }
    MatrixXd rpy = rotne_prager(coords, 0.01, p);
    auto end = high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    cout << "p = " << p << ", time elapsed: " << duration << " ms" << endl;
}

The Visual Studio profiling tools tell me roughly a third of the cpu time is spent in the outer_product_3d function (which is fairly reasonable, I think), the rest is in rotne_prager. I don't see any glaringly misperforming lines in the profiler, rather the entire thing just takes more time overall, although the result is in fact correct.

I haven't written any C++ in a long, long time (and none of what I've written in my day has really been performance sensitive), so I'm sure I've done a lot of things wrong. However, I've gone through the Eigen documentation and haven't discovered any real mistakes in the process, so I'd be grateful if anyone could point out the areas in my code that could be improved from a performance point of view. I can also provide the Julia code if anyone's interested in that (it doesn't really contain any special performance tricks, apart from the basic optimizations of @inbounds and the like).

Upvotes: 0

Views: 566

Answers (1)

ggael
ggael

Reputation: 29205

First of all, make sure you benchmarked with compiler optimizations ON, aka Release mode in visual studio.

Second, what's the purpose of the outer_product_3d function, why not doing rvec*rvec.transpose()?

Finally, you can significantly reduce the operation count as follows:

rpy.block<3,3>(3 * (i + 1) - 3, 3 * (j + 1) - 3) = ((1.0 - 2.0*distance_ratio * distance_ratio) * rvec) * rvec.transpose();
rpy.block<3,3>(3 * (i + 1) - 3, 3 * (j + 1) - 3).diagonal().array() += 0.75 * distance_ratio * ((1.0 + 2.0 / 3.0 * distance_ratio * distance_ratio);

In the first line, you avoid applying the scaling to the 3x3 matrix by applying it to a 3x1 vector only. The second line save 6 additions. The same applies to the else branch.

Upvotes: 2

Related Questions