Reputation: 1155
I have a tight loop exactly like what Chandler Carruth presented in CPP CON 2017: https://www.youtube.com/watch?v=2EWejmkKlxs at 25 mins in this video, there is a loop like this:
for (int& i:v)
i = i>255?255:i;
where v
is a vector. This is exactly the same code used in my program which after profiling, proves to take a good amount of time.
In his presentation, Chandler modified the assembly and speed up the loop. My question is, in practice, in a production code, what is the recommended approach to optimise this? Shall we use inline assembly in c++ code? Or like Chandler did, compile C++ code into assembly then optimise the assembler?
An example to optimise the above for loop will be really appreciated, assuming x86 architecture.
Upvotes: 1
Views: 324
Reputation: 364059
Chandler modified the compiler's asm output because that's an easy way to do a one-off experiment to find out whether a change would be useful, without doing all the stuff you'd normally want to include an asm loop or function as part of the source code for a project.
Compiler-generated asm is usually a good starting point for an optimized loop, but actually keeping the whole file as-is isn't a good or even viable way to actually maintain an asm implementation of a loop as part of a program. See @Aconcagua's answer.
Plus it defeats the purpose of having any other functions in the file written in C++ and being available for link-time optimization.
Re: actually clamping:
Note that Chandler was just experimenting with changes to the non-vectorized code-gen, and disabled unrolling + auto-vectorization. In real life hopefully you can target SSE4.1 or AVX2 and let the compiler auto-vectorize with pminsd
or pminud
for signed or unsigned int clamping to an upper bound. (Also available in other element sizes. Or without SSE4.1, just SSE2, maybe you can 2x PACKSSDW
=> packuswb
(unsigned saturation) then unpack with zeros back up to 4 vectors of dword elements. (If you can't just use an output of uint8_t[]
!)
And BTW, in the comments of the video, Chandler said it turns out that he made a mistake and the effect he was seeing wasn't really due to a predictable branch vs. a cmov. It might have been a code-alignment thing, because changing from mov %ebx, (%rdi)
to movl $255, (%rdi)
made a difference!
(AMD CPUs aren't known to have register-read stalls the way P6-family did, should have no trouble hiding the dep chain of a cmov coupling a store to a load vs. breaking it with branch prediction + speculation past a branch.)
You very rarely would actually want to use a hand-written loop. Often you can hand-hold and/or trick your compiler into making asm more like what you want, just by modifying the C++ source. Then a future compiler is free to tune differently for -march=some_future_cpu
.
Upvotes: 2
Reputation: 25516
My question is, in practice, in a production code, what is the recommended approach to optimise this? Shall we use inline assembly in c++ code? Or like Chandler did, compile C++ code into assembly then optimise the assembler?
For production code you need to consider that the software might be compiled and linked in an automatic build system.
How would you want to apply the code changes to assembler code in such a system? You might apply a diff file, but that might break if if optimisation (or other) settings are changed, if switching to another compiler or ...
Now remaining two options: write the entire function in an assembler file (.s) or have inline assembler code inside the C++ code – the latter possibly with the advantage of keeping related code in the same translation unit.
Still I'd let the compiler generate assembler code once – with highest optimisation level available. This code can then serve as a (already pre-optimised) base for your hand-made optimisations, of which the outcome should then be pasted back as inline assembly to the C++ source file or placed into a separate assembly source file.
Upvotes: 2