jk4736
jk4736

Reputation: 261

Multi-threaded performance and profiling

I have a program that scales badly to multiple threads, although – theoretically – it should scale linearly: it's a calculation that splits into smaller chunks and doesn't need system calls, library calls, locking, etc. Running with four threads is only about twice as fast as running with a single thread (on a quad core system), while I'd expect a number closer to four times as fast.

The run time of the implementations with pthreads, C++0x threads and OpenMP agree.

In order to pinpoint the cause, I tried gprof (useless) and valgrind (I didn't see anything obvious). How can I effectively benchmark what's causing the slowdown? Any generic ideas as to its possible causes?

— Update —

The calculation involves Monte Carlo integration and I noticed that an unreasonable amount of time is spent generating random numbers. While I don't know yet why this happens with four threads, I noticed that the random number generator is not reentrant. When using mutexes, the running time explodes. I'll reimplement this part before checking for other problems.

I did reimplement the sampling classes which did improve performance substantially. The remaining problem was, in fact, contention of the CPU caches (it was revealed by cachegrind as Evgeny suspected.)

Upvotes: 7

Views: 336

Answers (1)

Evgeny Kluev
Evgeny Kluev

Reputation: 24677

You can use oprofile. Or a poor man's pseudo-profiler: run the program under gdb, stop it and look where it is stopped. "valgrind --tool=cachegrind" will show you how efficiently CPU cache is used.

Monte Carlo integration seems to be very memory-intensive algorithm. Try to estimate, how memory bandwidth is used. It may be the limiting factor for your program's performance. Also if your system is only 2-core with hyperthreading, it should not work much faster with 4 threads, comparing with 2 threads.

Upvotes: 4

Related Questions