optimisation advice on value clamping in a loop

Question

I have a tight loop exactly like what Chandler Carruth presented in CPP CON 2017: https://www.youtube.com/watch?v=2EWejmkKlxs at 25 mins in this video, there is a loop like this:

for (int& i:v)
    i = i>255?255:i;

where v is a vector. This is exactly the same code used in my program which after profiling, proves to take a good amount of time.

In his presentation, Chandler modified the assembly and speed up the loop. My question is, in practice, in a production code, what is the recommended approach to optimise this? Shall we use inline assembly in c++ code? Or like Chandler did, compile C++ code into assembly then optimise the assembler?

An example to optimise the above for loop will be really appreciated, assuming x86 architecture.

Aconcagua · Accepted Answer

My question is, in practice, in a production code, what is the recommended approach to optimise this? Shall we use inline assembly in c++ code? Or like Chandler did, compile C++ code into assembly then optimise the assembler?

For production code you need to consider that the software might be compiled and linked in an automatic build system.

How would you want to apply the code changes to assembler code in such a system? You might apply a diff file, but that might break if if optimisation (or other) settings are changed, if switching to another compiler or ...

Now remaining two options: write the entire function in an assembler file (.s) or have inline assembler code inside the C++ code – the latter possibly with the advantage of keeping related code in the same translation unit.

Still I'd let the compiler generate assembler code once – with highest optimisation level available. This code can then serve as a (already pre-optimised) base for your hand-made optimisations, of which the outcome should then be pasted back as inline assembly to the C++ source file or placed into a separate assembly source file.

optimisation advice on value clamping in a loop

Answers (2)

Related Questions