Reputation: 9989
In certain areas of development such as game development, real time systems, etc., it is important to have a fast and optimized program. On the other side, modern compilers do a lot of optimization already and optimizing in Assembly can be time consuming in a world where deadlines are a factor to take into consideration.
Is optimizing certain functions with Assembly in a C/C++ program really worth it?
Is there really a sufficient gain in performance when optimizing a C/C++ program with Assembly with today's modern compilers?
What I understand with the answers posted, any gain that can be made is important in certain areas such as embedded systems, multimedia programming (graphics, sound, etc.). Also, one needs to be capable (or have someone capable) of doing a better job in Assembly than a modern compiler. Doing some really optimized C/C++ can take less time and can do a good enough job. One last thing, learning Assembly can help understand the inner mechanics of a program and make someone a better programmer in the end.
Upvotes: 13
Views: 2690
Reputation: 40659
Good answers. I would say "Yes" IF you have already done performance tuning like this, and you are now in the position of
KNOWING (not guessing) that some particular hot-spot is taking more than 30% of your time,
seeing just what assembly language the compiler generated for it, after all attempts to make it generate optimal code,
knowing how to improve on that assembler code.
being willing to give up some portability.
Compilers do not know everything you know, so they are defensive and cannot take advantage of what you know.
As one example, they write subroutine entry and exit code in a general way that works no matter what the subroutine contains. You, on the other hand, may be able to hand-code little routines that dispense with frame pointers, saving registers, and stuff like that. You're risking bugs, but it is possible to beat the compiler.
Upvotes: 1
Reputation: 20608
definitely yes!
Here is demonstration of a CRC-32 calculation which I wrote in C++, then optimized in x86 assembler using Visual Studio.
InitCRC32Table() should be called at program start. CalcCRC32() will calculate the CRC for a given memory block. Both function are implemented both in assembler and C++.
On a typical pentium machine, you will notice that the assembler CalcCRC32() function is 50% faster then the C++ code.
The assembler implementation is not MMX or SSE, but simple x86 code. The compiler will never produce a code that is as efficient as a manually crafted assembler code.
DWORD* panCRC32Table = NULL; // CRC-32 CCITT 0x04C11DB7
void DoneCRCTables()
{
if (panCRC32Table )
{
delete[] panCRC32Table;
panCRC32Table= NULL;
}
}
void InitCRC32Table()
{
if (panCRC32Table) return;
panCRC32Table= new DWORD[256];
atexit(DoneCRCTables);
/*
for (int bx=0; bx<256; bx++)
{
DWORD eax= bx;
for (int cx=8; cx>0; cx--)
if (eax & 1)
eax= (eax>>1) ^ 0xEDB88320;
else
eax= (eax>>1) ;
panCRC32Table[bx]= eax;
}
*/
_asm cld
_asm mov edi, panCRC32Table
_asm xor ebx, ebx
p0: _asm mov eax, ebx
_asm mov ecx, 8
p1: _asm shr eax, 1
_asm jnc p2
_asm xor eax, 0xEDB88320 // bit-swapped 0x04C11DB7
p2: _asm loop p1
_asm stosd
_asm inc bl
_asm jnz p0
}
/*
DWORD inline CalcCRC32(UINT nLen, const BYTE* cBuf, DWORD nInitVal= 0)
{
DWORD crc= ~nInitVal;
for (DWORD n=0; n<nLen; n++)
crc= (crc>>8) ^ panCRC32Table[(crc & 0xFF) ^ cBuf[n]];
return ~crc;
}
*/
DWORD inline __declspec (naked) __fastcall CalcCRC32(UINT nLen ,
const BYTE* cBuf ,
DWORD nInitVal= 0 ) // used to calc CRC of chained bufs
{
_asm mov eax, [esp+4] // param3: nInitVal
_asm jecxz p2 // __fastcall param1 ecx: nLen
_asm not eax
_asm push esi
_asm push ebp
_asm mov esi, edx // __fastcall param2 edx: cBuf
_asm xor edx, edx
_asm mov ebp, panCRC32Table
_asm cld
p1: _asm mov dl , al
_asm shr eax, 8
_asm xor dl , [esi]
_asm xor eax, [ebp+edx*4]
_asm inc esi
_asm loop p1
_asm pop ebp
_asm pop esi
_asm not eax
p2: _asm ret 4 // eax- returned value. 4 because there is 1 param in stack
}
// test code:
#include "mmSystem.h" // timeGetTime
#pragma comment(lib, "Winmm.lib" )
InitCRC32Table();
BYTE* x= new BYTE[1000000];
for (int i= 0; i<1000000; i++) x[i]= 0;
DWORD d1= ::timeGetTime();
for (i= 0; i<1000; i++)
CalcCRC32(1000000, x, 0);
DWORD d2= ::timeGetTime();
TRACE("%d\n", d2-d1);
Upvotes: 2
Reputation: 170489
Don't forget that by rewriting in assembly you lose portability. Today you don't care, but tomorrow your customers might want your software on another platform and them those assembly snippets will really hurt.
Upvotes: 1
Reputation: 6692
I'd say it's not worth it. I work on software that does real-time 3D rendering (i.e., rendering without assistance from a GPU). I do make extensive use of SSE compiler intrinsics -- lots of ugly code filled with __mm_add_ps()
and friends -- but I haven't needed to recode a function in assembly in a very long time.
My experience is that good modern optimizing compilers are pretty darn effective at intricate, micro-level optimizations. They'll do sophisticated loop transformations such as reordering, unrolling, pipelining, blocking, tiling, jamming, fission, and the like. They'll schedule instructions to keep the pipeline filled, vectorize simple loops, and deploy some interesting bit twiddling hacks. Modern compilers are incredibly fascinating beasts.
Can you beat them? Well, sure, given that they choose the optimizations to use by heuristics, they're bound to get it wrong sometimes. But I've found it's much better to optimize the code itself by looking at the bigger picture. Am I laying out my data structures in the most cache friendly way? Am I doing something unorthodox that misleads the compiler? Can I rewrite something a bit to give the compiler better hints? Am I better off recomputing something instead of storing it? Could inserting a prefetch help? Have I got false cache sharing somewhere? Are there small code optimization that the compiler thinks unsafe but is okay here (e.g., converting division to multiplication by the reciprocal)?
I like to work with the compiler instead of against it. Let it take care of the micro-level optimizations, so that you can focus on the mezzo-level optimizations. The important thing is to have a good idea how your compiler works so that you know where the boundaries between the two levels are.
Upvotes: 28
Reputation: 4435
I would say that for most people and most applications, its not worth it. Compilers are very good at optimising precisely for the architecture they're being compiled for.
That's not to say that optimising in assembly isn't unwarranted. A lot of math and low-level intensive code is often optimised by using specific CPU instructions such as SSE* etc to overcome the compiler's generated instruction/register use. In the end, the human knows precisely the point of the program. The compiler can only assume so much.
I would say that if you're not at the level where you know your own assembly will be faster, then I would let the compiler do the hard work.
Upvotes: 1
Reputation: 106127
For your typical small shop developer writing an App, the performance gain/effort trade-off almost never justifies writing assembly. Even in situations where assembly can double the speed of some bottleneck, the effort is often not justifiable. In a larger company, it might be justifiable if you're the "performance guy".
However, for a library writer, even small improvements for large effort are often justified, because it saves time for thousands of developers and users who use the library in the end. Even more so for compiler writers. If you can get a 10% efficiency win in a core system library function, that can literally save millennia (or more) of battery life spread across your user base.
Upvotes: 4
Reputation: 29862
I'll assume you've profiled your code, and you've found a small loop which is taking up most of the time.
First, try recompiling with more aggressive compiler optimizations, and then re-profile. If, you've running at will all compiler optimizations turned on, and you still need more performance, then I recommend looking at the generated assembly.
What I typically do after looking at the assembly code for the function, is see how I can change the C code to get the compiler to write better assembly. The advantage of doing it this way, is I then end up with code which is tuned to run with my compiler on my processor, but is portable to other environments.
Upvotes: 4
Reputation: 3156
There is one area where assembly optimisation is still regularly performed - embedded software. These processors are usually not very powerful, and have many architectural quirks that may not be exploited by the compiler for optimisation. That said, it should still only be done for particularly tight areas of code and it has to be very well documented.
Upvotes: 4
Reputation: 146053
You need a profile, which you get with a profiling tool, before you know. Some programs spend all their time waiting for a database, or they just don't have concentrated runtime in a small area. Without that, assembly doesn't help much.
There is a rule of thumb that 90% of the runtime happens in 10% of the code. You really want one very intense bottleneck, and not every program has that.
Also, the machines are so fast now that some of the low-hanging fruit has been eaten, so to speak, by the compilers and CPU cores. For example, say you write way better code than the compiler and cut the instruction count in half. Even then if you end up doing the same number of memory references, and if they are the bottleneck, you may not win.
Of course, you could start preloading registers in previous loop iterations, but the compiler is likely to already be trying that.
Learning assembly is really more important as a way to comprehend what the machine really is, rather than as a way to beat the compiler. But give it a try!
Upvotes: 5
Reputation: 625037
The only possible answer to that is: yes, if there is a performance gain that is relevant and useful.
The question should I guess really be: Can you get a meaningful performance gain by using assembly language in a C/C++ program?
The answer is yes.
The cases where you get a meaningful increase in performance have probably diminished over the last 10-20 years as libraries and compilers have improved but for an architecture like x86 in particular hand optimization in certain applications (particularly graphics related) this can be done.
But like anything don't optimize until you need to.
I would argue that algorithm optimization and writing highly efficient C (in particular) will create far more of a performance gain for less time spent than rewriting in assembly language in the vast majority of cases.
Upvotes: 11
Reputation: 41858
The difficulty is, can you do a better job of optimizing than the compiler can, given the architecture of modern cpus. If you are designing for a simple cpu (such as for embedded systems) then you may do reasonable optimizations, but, for a pipelined architecture the optimization is much harder as you need to understand how the pipelining works.
So, given that, if you can do this optimization, and you are working on something that the profiler tells you is too slow, and it is a part that should be as fast as possible, then yes optimizing makes sense.
Upvotes: 6