nowox
nowox

Reputation: 29178

Real-time programming C performance dilemma

I am working on a an embedded architecture where ASM is predominent. I would like to refactor most of our legacy ASM code in C in order to increase readability and modularity.

So I am still puzzling with minor details which causes my hopes to vanish. The real problem is far more complex that this following example, but I would like to share this as an entry point to the discussion.

My goal is to find a optimal workaround.

Here is the original example (do not worry about what the code does. I wrote this randomly just to show the issue I would like to talk about).

int foo;
int bar;
int tmp;
int sum;

void do_something() {
    tmp = bar;
    bar = foo + bar;
    foo = foo + tmp;
}

void compute_sum() {
    for(tmp = 1; tmp < 3; tmp++)
        sum += foo * sum + bar * sum;
}

void a_function() {
    compute_sum();
    do_something();
}

With this dummy code, anyone would immediately remove all the global variables and replace them with local ones:

void do_something(int *a, int *b) {
    int tmp = *b;
    *b = *a + *b;
    *b = tmp + *a;
}

void compute_sum(int *sum, int foo, int bar) {
    int tmp;
    for(tmp = 1; tmp < 3; tmp++)
        sum += *foo * sum + *bar * sum;
}

void a_function(int *sum, int *foo, int *bar) {
    compute_sum(sum, foo, bar);
    do_something(foo, bar);
}

Unfortunately this rework is worse than the original code because all the parameters are pushed into the stack which leads to latencies and larger code size.

The everything globals solution is both the best the ugliest solution. Especially when the source code is about 300k lines long with almost 3000 global variables.

Here we are not facing a compiler problem, but a structural issue. Writing beautiful, portable, readable, modular and robust code will never pass the ultimate performance test because compilers are dumb, even is 2015.

An alternative solution is to rather prefer inline functions. Unfortunately these functions have to be located into a header file which is also ugly.

A compiler cannot see further the file it is working on. When a function is marked as extern it will irrevocably lead to performance issues. The reason is the compiler cannot make any assumptions regarding the external declarations.

In the other way, the linker could do the job and ask the compiler to rebuild objects files by givin additionnal information to the compiler. Unfortunately not many compilers offer such features and when they do, they considerably slow down the build process.

I eventually came accross this dilemma:

  1. Keep the code ugly to preserve performances

    • Everything's global
    • Functions without parameters (same as procedures)
    • Keeping everything in the same file
  2. Follow standards and write clean code

    • Think of modules
    • Write small but numerous functions with well defined parameters
    • Write small but numerous source files

What to do when the target architecture has limited ressources. Going back to the assembly is my last option.

Additional Information

I am working on a SHARC architecture which is a quite powerful Harvard CISC architecture. Unfortunately one code instruction takes 48bits and a long only takes 32bits. With this fact it is better to keep to version of a variable rather than evaluating the second value on the fly:

The optimized example:

int foo;
int bar;
int half_foo;

void example_a() {
   write(foo); 
   write(half_foo + bar);
}

The bad one:

void example_a(int foo, int bar) {
   write(foo); 
   write(bar + (foo >> 1));
}

Upvotes: 3

Views: 217

Answers (3)

user3629249
user3629249

Reputation: 16550

I have designed/written/tested/documented many many real time embedded systems.

Both 'soft' real time and 'hard' real time.

I can tell you from hard earned experience that the algorithm used to implement the application is the place to make the biggest gains in speed.

Little stuff like a function call compared to in-line is trivial unless performed thousands (or even hundreds of thousands) of times

Upvotes: 0

user4842163
user4842163

Reputation:

I'm used to working in performance-critical core/kernel-type areas with very tight needs, often being beneficial to accept the optimizer and standard library performance with some grain of salt (ex: not getting too excited about the speed of malloc or auto-generated vectorization).

However, I've never had such tight needs so as to make the number of instructions or the speed of pushing more arguments to the stack be a considerable concern. If it is, indeed, a major concern for the target system and performance tests are failing, one thing to note is that performance tests modeled at a micro level of granularity often do have you obsessed with smallest of micro-efficiencies.

Micro-Efficiency Performance Tests

We made the mistake of writing all kinds of superficial micro-level tests in a former workplace I was at where we made tests to simply time something as basic as reading one 32-bit float from a file. Meanwhile, we made optimizations that significantly sped up the broad, real-world test cases associated with reading and parsing the contents of entire files while, at the same time, some of those uber-micro tests actually got slower for some unbeknownst reason (they weren't even directly modified, but changes to the code around them may have had some indirect impact relating to dynamic factors like caches, paging, etc., or merely how the optimizer treated such code).

So the micro-level world can get a bit more chaotic when you work with a higher-level language than assembly. The performance of the teeny things can shift under your feet a bit, but you have to ask yourself what's more important: a slight decrease in the performance of reading one 32-bit float from a file, or having real-world operations that read from entire files go significantly faster. Modeling your performance tests and profiling sessions at a higher level will give you room to selectively and productively optimize the parts that really matter. There you have many ways to skin a cat.

Run a profiler on an ultra-granular operation being executed a million times repeatedly and you would have already backed yourself into an assembly-type micro-corner for everything performing such micro-level tests just by the nature of how you are profiling the code. So you really want to zoom out a bit there, test things at a coarser level so that you can act like a disciplined sniper and hone in on the micro-efficiency of very select parts, dispatching the leaders behind inefficiencies rather than trying to be a hero taking out every little insignificant foot soldier that might be a performance obstacle.

Optimizing Linker

One of your misconceptions is that only the compiler can act as an optimizer. Linkers can perform a variety of optimizations when linking object files together, including inlining code. So there should rarely, if ever, be a need to jam everything into a single object file as an optimization. I'd try looking more into the settings of your linker if you find otherwise.

Interface Design

With these things aside, the key to a maintainable, large-scale codebase lies more in interface (i.e., header files) than implementation (source files). If you have a car with an engine that goes a thousand miles per hour, you might peer under the hood and find that there are little fire-breathing demons dancing around to allow that to happen. Perhaps there was a pact involved with demons to get such speed. But you don't have to expose that fact to the people driving the car. You can still give them a nice set of intuitive, safe controls to drive that beast.

So you might have a system that makes uninlined function calls 'expensive', but expensive relative to what? If you are calling a function that sorts a million elements, the relative cost of pushing a few small arguments to the stack like pointers and integers should be absolutely trivial no matter what kind of hardware you're dealing with. Inside the function, you might do all sorts of profiler-assisted things to boost performance like macros to forcefully inline code no matter what, perhaps even some inlined assembly, but the key to keeping that code from cascading its complexity throughout your system is to keep all that demon code hidden away from the people who are using your sort function and to make sure it's well-tested so that people don't have to constantly pop the hood trying to figure out the source of a malfunction.

Ignoring that 'relative to what?' question and only focusing on absolutes is also what leads to the micro-profiling which can be more counter-productive than helpful.

So I'd suggest looking at this more from a public interface design level, because behind an interface, if you look behind the curtains/under the hood, you might find all kinds of evil things going on to get that needed edge in performance in hotspot areas shown in a profiler. But you shouldn't need to pop the hood very often if your interfaces are well-designed and well-tested.

Globals become a bigger problem the wider their scope. If you have globals defined statically with internal linkage inside a source file that no one else can access, then those are actually rather 'local' globals. If thread-safety isn't a concern (if it is, then you should avoid mutable globals as much as possible), then you might have a number of performance-critical areas in your codebase where if you peer under the hood, you find file scope-static variables a lot to mitigate the overhead of function calls. That's still a whole lot easier to maintain than assembly, especially when the visibility of such globals are reduced with smaller and smaller source files dedicated to performing more singular, clear responsibilities.

Upvotes: 4

Ben Voigt
Ben Voigt

Reputation: 283971

Ugly C code is still a lot more readable than assembler. In addition, it's likely that you'll net some unexpected free optimizations.

A compiler cannot see further the file it is working on. When a function is marked as extern it will irrevocably lead to performance issues. The reason is the compiler cannot make any assumptions regarding the external declarations.

False and false. Have you tried "Whole Program Optimization" yet? The benefits of inline functions, without having to organize into headers. Not that putting things in headers is necessarily ugly, if you organize the headers.

In your VisualDSP++ compiler, this is enabled by the -ipa switch.

The ccts compiler has a capability called interprocedural analysis (IPA), a mechanism that allows the compiler to optimize across translation units instead of within just one translation unit. This capability effectively allows the compiler to see all of the source files that are used in a final link at compilation time and make use of that information when optimizing.

All of the -ipa optimizations are invoked after the initial link, whereupon a special program called the prelinker reinvokes the compiler to perform the new optimizations.

Upvotes: 7

Related Questions