How to optimize large data manipulation in parallel

Question

I'm developing a C/C++ application to manipulate large quantities of data in a generic way (aggregation/selection/transformation). I'm using a AMD Phenom II X4 965 Black Edition, so with decent amount of different caches.

I've developed both ST and MT version of the functions to perform all the single operations and, not surprisingly, in the best case the MT version are 2x faster than the ST, even when using 4 cores.

Given I'm a fan of using 100% of available resources, I was pissed about the fact just 2x, I'd want 4x.
For this reason I've spent already quite a considerable amount of time with -pg and valgrind, using the cache simulator and callgraph. The program is working as expected and cores are sharing the input process data (i.e. operations to apply on data) and the cache misses are reported (as expected sic.) when the different threads load the data to be processed (millions of entities or rows if now you have an idea what I'm trying to do :-) ). Eventually I've used different compilers, g++ and clang++, with -O3 both, and performance is identical.

My conclusion is that due to the large amount of data (GB of data) to process, given the fact the data has got to be loaded eventually in the CPU, this is real wait time. Can I further improve my software? Have I hit a limit?

I'm using C/C++ on Linux x86-64, Ubuntu 11.10. I'm all ears! :-)

How to optimize large data manipulation in parallel

Answers (1)

Related Questions