Reputation: 13
How can I enhance the algebraic efficiency of my code for a custom implemented type?
Is there maybe a virtual empty type from which I can inherit that will tell the compiler that this is fast-math stuff? Or is there a compile-time type that somebody implemented that provides all these expression templates.
Upfront remark: For anyone who makes assumptions on the compile time and run-time, these assumptions are wrong. There are two sequential passes of compilation and execution.
The first pass is compiling and running an initial_source.cpp
. This program will upon execution write an essay called generated_program.cpp
.
The essay will not change no matter what compiler options are used.
Next, the generated_program.cpp
is compiled and run. I care about the performance of this last run. As it turns out, when the initial_source.cpp
is algebraically concise, then the generated_program.cpp
is shorter and faster.
For the math nerds: Any distinct but algebraic equivalent initial_source.cpp files will create distinct but algebraic equivalent
generated_program.cpp`` files (when ignoring round-off errors).
So now what I want to tell the compiler is not:
initial_source.cpp
so please go optimize that for me.But instead:
I have a custom type Number
that can be treated like a double
or float
, etc. .
Given any function
template<typename Tfloat>
void f(Tfloat* y, Tfloat* x){ ... }
I can call it with my custom type and therefrom generate other code. (think of adjoints, tangents, adjacency matrices, graphs, etc). I like the generated code best if it is cheap to evaluate, but its length and efficiency dramatically depend on subtleties within f
.
Now, if a line within f reads
auto z1 = (x-(-y));
auto z2 = -(x-(-y));
auto z3 = 1+x+2+3+4;
auto z4 = cos(-x);
auto z5 = -sin(-x);
then I am sure we all agree that is is inferior to
auto z1 = x+y;
auto z2 = -z1;
auto z3 = x+(1+2+3+4);
auto z4 = cos(x);
auto z5 = sin(x);
The first version may result in an adjoint tape that is twice as long in terms of code text, requires twice the amount memory, and is in consequence ten times slower than the code generated from the second version.
Yet, naturally, the compiler will not touch any algebraic expressions and assignments that involve my number, simply because it cannot even tell whether addition commutes for my type.
I know how to implement expression templates; but this would mean an incredible amount of work. Given that the compiler is so excellent in doing just this kind of work for standard types, it would be incredibly stupid if I spent my time to just try to reproduce an inferior version of that, even more so as the algebra is just a tiny detail in what my actual project task is.
I am aware that "the optimal" algebraic version of any program depends on the implementation details of my Number
, but it is perfectly fine for us to assume that the relative cost per operation is equal to as for double
(i.e., add is way cheaper than mul unless FMA, and no sign-switch is cheaper than yes sign-switch). And if that involves too many issues, the spirit is: I just need something that is better than what I have and does not reinvent the wheel (i.e., I won't implement expression templates for sign-switch and order of sum & prod).
Since many algebraic objects do not commute (e.g. matrices in multiplication), the compiler will not optimize the expression for z3
, even x
is as heavy as a matrix and we just want to add a constant to each term, as per the operator overload. Likewise, the compiler will not replace x-(-y)
with x+y
because nothing is stopping a developer from implementing the unary minus operator from not being its own inverse operator.
If a compiler was all-knowing, in principle the generated code (which is flat, e.g. all types are double) could be optimized by the compiler to the same level of mastery as the code generated from an algebraically optimized initial code, because, after all, the two generated codes are likewise algebraically equivalent (that is, before round-off). Yet, real-world compilers produce much slower code from generated code of an algebraically poor initial code version.
Some readers believe the compiler will ultimately optimize both generated codes to same level. To break this down, let's start from the original code. Suppose a Number
that just prints a different string on the monitor for every operation taking place. Naturally, the compiler -- regardless what optimization option -- will not change that monitor output. Conclusively, tapes, text files, generated codes, created from the run of the initial code with the custom type Number
will result in the same code, regardless what optimizations are being used. That highlights the issue with z3
.
Next, let's look into adjoints. For adjoints, each operation creates at least one number on the tape to be stored with a derivative computation taking place later on in the reverse order of the tape. It is not unrealistic to picture code here that is several 100k lines long. I don't care how awesome your compiler is. It will not identify that skipping two operations with two elements on the tape that are 100k lines apart in an entirely nonlinear piece of incomprehensible auto-code can be removed due to algebraic equivalence. That is just not what compilers do. They analyze code locally. Again, in theory all generated assembly code is equivalent so under best optimization would be identical (in case the optimal solution is unique), but I am interested in a real-world solution.
Suppose
void f(Tfloat& y, Tfloat& x){ y = 1+x+2+3+4; }
This will create the following adjoint c++ code file:
void gf(Tfloat& y, Tfloat& x, Tfloat& gy, Tfloat& gx){
auto tmp1 = 1+x;
auto tmp2 = tmp1+1;
auto tmp3 = tmp2+2;
y = tmp3+3;
auto gtmp3+= gy;
auto gtmp2+= gtmp3;
auto gtmp1+= gtmp2;
gx += gtmp1;
}
In contrast, an algebraically optimized initial version of the code would yield the following distinct tape:
void gf(Tfloat& y, Tfloat& x, Tfloat& gy, Tfloat& gx){
y = x+(1+2+3+4);
gx += gy;
}
Admittedly, when taking both generated pieces of code, then the perfect compiler could produce identically optimized code from both of them. But in a real world and for much much much much bigger files, this just is not the case.
Upvotes: 0
Views: 85