Multi-threaded performance and profiling

Question

I have a program that scales badly to multiple threads, although – theoretically – it should scale linearly: it's a calculation that splits into smaller chunks and doesn't need system calls, library calls, locking, etc. Running with four threads is only about twice as fast as running with a single thread (on a quad core system), while I'd expect a number closer to four times as fast.

The run time of the implementations with pthreads, C++0x threads and OpenMP agree.

In order to pinpoint the cause, I tried gprof (useless) and valgrind (I didn't see anything obvious). How can I effectively benchmark what's causing the slowdown? Any generic ideas as to its possible causes?

— Update —

The calculation involves Monte Carlo integration and I noticed that an unreasonable amount of time is spent generating random numbers. While I don't know yet why this happens with four threads, I noticed that the random number generator is not reentrant. When using mutexes, the running time explodes. I'll reimplement this part before checking for other problems.

I did reimplement the sampling classes which did improve performance substantially. The remaining problem was, in fact, contention of the CPU caches (it was revealed by cachegrind as Evgeny suspected.)

Evgeny Kluev · Accepted Answer

You can use oprofile. Or a poor man's pseudo-profiler: run the program under gdb, stop it and look where it is stopped. "valgrind --tool=cachegrind" will show you how efficiently CPU cache is used.

Monte Carlo integration seems to be very memory-intensive algorithm. Try to estimate, how memory bandwidth is used. It may be the limiting factor for your program's performance. Also if your system is only 2-core with hyperthreading, it should not work much faster with 4 threads, comparing with 2 threads.

Multi-threaded performance and profiling

Answers (1)

Related Questions