Reputation: 22932
Ok, I've been sitting in front of profiler results over the last three days, generated by running a pretty wide range of test cases through an automation suite. The idea is to see if there are any good optimisations that can generally improve performance. I'll qualify good in this context as follows;
Has the potential for performance improvement that are both very significant and observable at end user level, e.g. > 100% improvement in an underperforming area.
Has the potential for core space usage reduction e.g. > 50% reduction in a data heavy area.
Is easy to implement, with minimal obfuscation to the code, and minimal side effects. i.e. the benefits of implementing the optimisation greatly outweight the costs.
The application is a 3d mapping and modelling package with plenty of graphics in the interface and geometric processing at the back end. I've already done a lot on ensuring optimal algorithm selection for most processing, and at this stage I'm looking for any generally applicable easy ways to get that extra bit of punch when dealing with large and complex data sets. So far, I've come up with the following;
When searching, keep a buffer of the last recently found items and check that first. A heck of a lot of processing with repetetive searches seem to search around the same area. From answers to date this appears to be a specific form of memoization
When sorting, check the data isn't already in sort order (specifically where qsort has been used)
Keep the GUI and processing in seperate threads (Fails the good criteria of being easy to implement, but IMO still worthwhile)
Where you have local class variables, that have significant construction / destruction times, in heavily used member functions, make them private class members. Notably true of dynamic arrays and strings, especially MFC CArrays and CStrings.
When using dynamic arrays, set the initial size to slightly exceed typical usage, and have an exponential growth strategy.
When dealing with very large data sets to be stored in arrays, size up the data first to avoid any reallocs.
Avoid function returns that create temporary object copies on the stack, use reference parameters instead, e.g.
CString MyFunc(double x, double y)
is less efficient than
void MyFunc(double x, double y, CString &Result)
actually, avoid CStrings and most of MFC in any performance critical area of the code. (Edit: this may be more generally negated by RVO, although not for CStrings in my app)
These are items that seem to work well in my context, but are there any obvious ones that I've left out, or are there any other good resources out there on optimisation?
Edit: Based on many of the comments provided, some further explanation is clearly required. While I fully realise that suggesting specific optimisations for a specific piece of code requires sight of that code, the past couple of days spent analysing profiler output have shown up certain patterns in terms of optimisation candidates. I'm also aware of my own ignorance in respect of what others are doing to good effect in the field, and see value (to me at least) in having an enumerated list of such techniques, regardless of whether they apply to my situation. This question is not about the dangers of optimization, but for any beginners out there, I'd recommend you first establish a strong requirement for the need to optimize prior to considering in the first place. My own preference is to carry out most optimization at design stage based on performance requirements going forward, but I'm also a strong advocate of profiling to verify design assumptions have been met in implementation. I would ask people to please limit their answers to their own experience of positive optimizations rather than on their beliefs as to whether we should consider optimization in the first instance.
FWIW, the difference between having the compiler optimise the code or not comes out at 12% across my automation suite, which is borderline observable at end user level.
Second edit: Some related posts I found very beneficial, particularly Mike Dunlavey's comments on becoming too dependant on profiler output.
Performance optimization strategies of last resort?
What can I use to profile C++ code in Linux?
Upvotes: 5
Views: 1197
Reputation: 490778
I'm surprised nobody seems to have responded to one point. In your question, you mention using qsort()
. In quite a few cases, you can get quite a substantial speed improvement by using std::sort()
instead of qsort()
. The size of improvement generally depends on how complex your comparison function is, but I've found 2:1 improvement fairly typical, and 5:1 or even 10:1 is sometimes possible.
Upvotes: 3
Reputation: 40709
I sympathize with your position, especially
I've been sitting in front of profiler results over the last three days
I assume you've done a good job of coding the app intelligently. Now, here's what I suggest:
do not look for "any good optimisations that can generally improve performance" that you just try by guesswork. These will be obvious when you know what's taking time.
do have a better way of finding out what's taking time than staring at profiler output.
This is how I do it.
My experience says you have lots more room for speedup. This is my canonical example.
... and good luck. Let us know how it works out.
Upvotes: 2
Reputation: 6070
+1 Good question.
rant I think you were right to request people not to bleat about premature optimisation - all too often this old chestnut is wheeled out as an excuse for lazy design or sloppy implementation. There is such a thing as gratuitous under-optimisation, often caused by poor algorithm selection. end rant
I don't know where this 97% thing comes from. I was taught the 80/20 rule - 20% of the code runs 80% of the time, which interestingly also seems to apply to other things besides software. Anyway...
First port of call is always the algorithm - ensure you are using an efficient algorithm. An unoptimised good algorithm will almost always beat a highly optimised bad algorithm. I suspect you already know this.
A few general optimisations I've found can be useful:
Linear searches are often responsible for quite a lot of overhead. Often these can be replaced by binary searching a sorted collection.
Care is needed when sorting. Although quicksort is a great general purpose sorting algorithm, sometimes bubble sort or other kinds of sorting are faster when your set is already sorted or partially sorted.
Inserting into a sorted set - could use a binary search to find the right place instead of the naive (although often good enough) implementation of "stick it on the end, then sort".
Separating out your keys from your values should help make searching keys faster by utilising the cache more efficiently.
Use a double buffer and swap when two threads are interacting as supplier/consumer to minimise the time the buffer needs to be locked. It's often faster for threads to work on separate copies of data and then fix up later rather than have them trip over each others locks.
Don't rely on the compiler to optimise away constructors and destructors inside tight loops - either move objects outside the loop and reuse or confirm that the compiler has optimised it as you would wish.
Upvotes: 1
Reputation: 248289
Start by optimizing your own time. Don't bother trying to list and/or apply optimizations blindly. Don't waste your time converting return values to reference parameters just because you don't trust the compiler to do NRVO.
Don't waste your time manually marking functions as inline
. Don't waste your time trying to collect some kind of universal "Dictionary of Optimizations".
97% of your code just doesn't matter performance-wise. If you try to apply optimizations regardless of what they do and where they're useful, you're going to waste 97% of your time. All that time could have been spent optimizing the 3% of the code that actually mattered. (Incidentally, this is what Knuth actually meant with the "root of all evil" quote. Not that optimizations shouldn't be performed, but that unless you have a specific piece of code in mind which you already know is a hotspot, your optimizations will be 1) premature, and 2) ineffective)
So first optimization advice: Close this question and instead ask for help in optimizing the specific code that matters in your app. You're not going to learn anything useful about optimizing the 3% of your app that matter by asking for general optimization tricks.
Second optimization advice (assuming that you're looking at a specific hotspot right now and that you've done all you can in terms of choosing the right algorithm and parallelizing and other high-level optimizations): Look at the assembly output from your compiler.
Third optimization advice: Understand the system you're running on. Structure your code to exploit spatial and temporal locality, to minimize cache misses. Simply switching traversal of a 2D array from column-major to row-major order can easily double performance. Keep in mind that the compiler and CPU will both reorder instructions to improve throughput, but branches limit their ability to do this, so try to structure your code to get reasonably large basic blocks with no jumps into or out of them. If you're running on a CPU that supports SIMD instructions, consider whether they can be used efficiently. If you have to really dive into instruction-level optimizations, make sure you have a grasp on the latencies of the instructions used. For floating-point heavy code, keep in mind that FP instructions in general will not be automatically reordered by the compiler or CPU. Because FP operations have fairly long latencies, dependency chains can be a real performance killer. Breaking those up manually can speed your code up significantly. Similarly, avoid memory dependencies. Code that first writes, then reads an address is going to be slow. If one or both memory access can't be optimized away, then you have to wait for the write to complete (which could otherwise happen in the background without stalling the CPU), before starting the read. Place all frequently used data in temporaries, to avoid aliasing.
As for optimizing "large complex datasets" as you asked? I have absolutely no clue. The thing about complex datasets is that they have very little in common. There is no general way to optimize them.
A final suggestion: It sounds like you're not really writing C++. You're talking about manually implementing dynamic arrays, you're talking about reallocs and MFC classes. It sounds like you're writing "C with classes". Use the standard library.std::vector
already exists. So does std::string
. The C++ standard library has the nice property that it's extremely efficient. The same can't be said for MFC classes.
Upvotes: 3
Reputation: 111336
please don't throw up any of the old 'optimisation is the root of all evil' stuff, it's totally irrelevent to this question
Yes, and then you have things like:
When sorting, check the data isn't already in sort order
makes me wonder if you're using an efficient algorithm or not. Which, is the basic premise that "premature optimization is the root of all evil" states.
and will be downvoted.
Really? That tone is no good. IMO. YMMV.
Similarly, I'm not interested in the joys of letting the compiler optimise it for me, or throwing inlines all over the place. FWIW, the difference between having the compiler optimise the code or not comes out at 12% across my automation suite, which is borderline observable at end user level.
Coupled with any optimization you hand-instrument, you'd still want the compiler optimizations on.
Apart from that, since you provide no particular insights on where your bottlenecks are, it is difficult if not impossible to provide any pointers. I'd hazard a guess that you at the least:
Edit: Since you say you were not aware of RVO: try reading up on move semantics and particularly this library: move
library from Adobe. I guess Boost will have something similar.
Edit#2: Two more things:
Upvotes: 9
Reputation: 78364
I fear large and complex data structures. I like large and simple data structures; in particular arrays. When I have large and simple data structures I can try to do clever things with memory access to really optimise my use of the cache; in particular memory-tiling. Not sure if this is any use to you, but in general, given your set of requirements and existing understanding of your code, I'd be looking for ways to optimise the getting of data from RAM to CPU.
And, I would, of course, be parallelising the hell out of it all. Not possible unless you have a multi-computer. Oh, memo just in, we've all got those these days !!
Good luck and do let us know how you get on. I read a lot of crap on SO about what should be the best way to optimise this bit of code or that bit, very little hard evidence that anyone ever measures anything as you seem to have done.
Heck, I like your question so much I'm upvoting it.
Regards
Mark
Upvotes: 3
Reputation: 22493
CString MyFunc(double x, double y)
is less efficient than
void MyFunc(double x, double y, CString &Result)
If MyFunc is written cleanly they should be about the same. The compiler should be able to take advantage of NRVO. It sounds like you've profiled and found that its not - I'm just saying it may more fit your criteria, e.g. "minimal obfuscation to the code", to rearrange the function itself a little to allow NRVO to occur.
A few more things to try:
Sorted<T>
) that
make such assumptions explicit. That
way if you have a method that takes
a Sorted<vector<T> >
, for example,
and you give it a sorted vector, it
passes straight through - but if you
give it a vector<T>
, it will have to
constructed a Sorted<vector<T> >
, at
which point it will sort it. You can
manually assert that it is sorted
using an alternative constructor,
but it makes it much easier to carry
your assumptions around and maybe
catch places you might have
otherwise missed.Upvotes: 5
Reputation: 783
I think your requirements are pretty much mutually exclusive unless there's an obvious flaw of some kind (which is all profiling is really good for finding).
Things that really change performance require a lot of effort, and your basic data structures are the most important thing. Reducing memory fragm., aligned memory management, SIMD, data structures that are small as possible and allocated all in one block as much as possible, multithreading, reducing code size from templates, redeclaring parameters as local variables so the optimizer can know they are same to optimize. None of those can be tacked on at the end without a lot of cost.
Many times you can't even easily measure the things that really affect performance because they only become costly as the program runs or as its code size grows.
Upvotes: 3
Reputation: 213210
If you run the code under a sampling profiler and find that there are any compute-intensive routines taking a significant chunk of time then there are numerous micro-optimisations that may be considered, up to and including the use of SIMD if you are able to limit your support to specific CPU(s).
Upvotes: 0