Dipan Mehta
Dipan Mehta

Reputation: 2160

Analyzing and profiling multi-threaded application

We are having a multithreaded application which has heavy packet processing across multiple pipeline stages. The application is in C under Linux.

The entire application works fine and has no memory leaks or thread saftey issues. However, in order to analyse the application, how can we profile and analyse the threads?

In particular here is what we are interested in:

  1. the resource usage done by each thread
  2. frequency and timing with which threads were having contentions to acquire locks
  3. Amount of overheads due to synchronization
  4. any bottlenecks in the system
  5. what is the best system throughput we can get

What are the best techniques and tools available for the same?

Upvotes: 7

Views: 5880

Answers (4)

Brian Swift
Brian Swift

Reputation: 1443

Take a look at at Intel VTune Amplifier XE (formerly … Intel Thread Profiler) to see if it will meet your needs. This and other Intel Linux development tools are available free for non-commercial use.

In the video Using the Timeline in Intel VTune Amplifier XE a timeline of a multi-threaded application is demonstrated. The presenter uses a graphic display to show lock activity and how to dig down to the source line of the particular lock causing serialization. At 9:20 the presenter mentions "with the frame API you can programmatically mark certain events or phases in your code. And these marks will appear on the timeline."

Upvotes: 3

ehuffman
ehuffman

Reputation: 141

Do you have flexibility to develop under Darwin (OSX) and deploy on Linux? The performance analysis tools are excellent and easy to use (Shark and Thread Viewer are useful for your purpose).

There are many Linux performance tools, of course. gprof, Valgrind (with Cachegrind, Callgrind, Massif), and Vtune will do what you need.

To my knowledge, there is no tool that will directly answer your questions. However, the answers may be found by cross referencing the data points and metrics from both instrumentation and sampling based solutions.

Upvotes: 0

Mike Dunlavey
Mike Dunlavey

Reputation: 40659

I worked on a similar system some years ago. Here's how I did it:

Step 1. Get rid of unnecessary time-takers in individual threads. For that I used this technique. This is important to do because the overall messaging system is limited by the speed of its parts.

Step 2. This part is hard work but it pays off. For each thread, print a time-stamped log showing when each message was sent, received, and acted upon. Then merge the logs into a common timeline and study it. What you are looking for is a) unnecessary retransmissions, for example due to timeouts, b) extra delay between the time a message is received and when it is acted upon. This can happen, for example, if a thread has multiple messages in its input queue, some of which can be processed more quickly than others. It makes sense to process those first.

You may need to alternate between these two.

Don't expect this to be easy. Some programmers are too fine to be bothered with this kind of dirty work. But, you could be pleasantly surprised at how fast you can make the whole thing go.

Upvotes: 1

Martin James
Martin James

Reputation: 24847

1) Don't know. There are some profilers available for linux.

2) If you are pipelining, each stage should be doing sufficient work to ensure that contention on the P-C queues is minimal. You can dig this out with some timings - if a stage takes 10ms+ to process a packet, you can forget about contention/lock issues. If it takes 100uS, you should consider amalgamating a couple stages so that each stage does more work.

3) Same as (2), unless there is a separate synchronization issue with some global data or whatever.

4) Dumping/logging the queue counts every second would be useful. The longest queue will be before the stage with the narrowest neck.

5) No idea - don't know how your current system works, what hardware it's running on etc. There are some 'normal' optimizations - eliminating memory-manager calls with object pools, adding extra threads to stages with the heaviest CPU loadings, things like that, but 'what is the best system throughput we can get' - too ethereal to say.

Upvotes: 0

Related Questions