Profiling a multiprocess system

Question

I have a system that i need to profile.

It is comprised of tens of processes, mostly c++, some comprised of several threads, that communicate to the network and to one another though various system calls.

I know there are performance bottlenecks sometimes, but no one has put in the time/effort to check where they are: they may be in userspace code, inefficient use of syscalls, or something else.

What would be the best way to approach profiling a system like this? I have thought of the following strategy:

Manually logging the roundtrip times of various code sequences (for example processing an incoming packet or a cli command) and seeing which process takes the largest time. After that, profiling that process, fixing the problem and repeating.

This method seems sorta hacky and guess-worky. I dont like it.

How would you suggest to approach this problem? Are there tools that would help me out (multi-process profiler?)?

What im looking for is more of a strategy than just specific tools.

Should i profile every process separately and look for problems? if so how do i approach this?

Do i try and isolate the problematic processes and go from there? if so, how do i isolate them?

Are there other options?

Mats Petersson · Accepted Answer

I don't think there is a single answer to this sort of question. And every type of issue has it's own problems and solutions.

Generally, the first step is to figure out WHERE in the big system is the time spent. Is it CPU-bound or I/O-bound?

If the problem is CPU-bound, a system-wide profiling tool can be useful to determine where in the system the time is spent - the next question is of course whether that time is actually necessary or not, and no automated tool can tell the difference between a badly written piece of code that does a million completely useless processing steps, and one that does a matrix multiplication with a million elements very efficiently - it takes the same amount of CPU-time to do both, but one isn't actually achieving anything. However, knowing which program takes most of the time in a multiprogram system can be a good starting point for figuring out IF that code is well written, or can be improved.

If the system is I/O bound, such as network or disk I/O, then there are tools for analysing disk and network traffic that can help. But again, expecting the tool to point out what packet response or disk access time you should expect is a different matter - if you contact google to search for "kerflerp", or if you contact your local webserver that is a meter away, will have a dramatic impact on the time for a reasonable response.

There are lots of other issues - running two pieces of code in parallel that uses LOTS of memory can cause both to run slower than if they are run in sequence - because the high memory usage causes swapping, or because the OS isn't able to use spare memory for caching file-I/O, for example.

On the other hand, two or more simple processes that use very little memory will benefit quite a lot from running in parallel on a multiprocessor system.

Adding logging to your applications such that you can see WHERE it is spending time is another method that works reasonably well. Particularly if you KNOW what the use-case is where it takes time.

If you have a use-case where you know "this should take no more than X seconds", running regular pre- or post-commit test to check that the code is behaving as expected, and no-one added a lot of code to slow it down would also be a useful thing.

Profiling a multiprocess system

Answers (1)

Related Questions