Reputation: 261
I have a program that scales badly to multiple threads, although – theoretically – it should scale linearly: it's a calculation that splits into smaller chunks and doesn't need system calls, library calls, locking, etc. Running with four threads is only about twice as fast as running with a single thread (on a quad core system), while I'd expect a number closer to four times as fast.
The run time of the implementations with pthreads, C++0x threads and OpenMP agree.
In order to pinpoint the cause, I tried gprof (useless) and valgrind (I didn't see anything obvious). How can I effectively benchmark what's causing the slowdown? Any generic ideas as to its possible causes?
— Update —
The calculation involves Monte Carlo integration and I noticed that an unreasonable amount of time is spent generating random numbers. While I don't know yet why this happens with four threads, I noticed that the random number generator is not reentrant. When using mutexes, the running time explodes. I'll reimplement this part before checking for other problems.
I did reimplement the sampling classes which did improve performance substantially. The remaining problem was, in fact, contention of the CPU caches (it was revealed by cachegrind as Evgeny suspected.)
Upvotes: 7
Views: 336
Reputation: 24677
You can use oprofile. Or a poor man's pseudo-profiler: run the program under gdb, stop it and look where it is stopped. "valgrind --tool=cachegrind" will show you how efficiently CPU cache is used.
Monte Carlo integration seems to be very memory-intensive algorithm. Try to estimate, how memory bandwidth is used. It may be the limiting factor for your program's performance. Also if your system is only 2-core with hyperthreading, it should not work much faster with 4 threads, comparing with 2 threads.
Upvotes: 4