Emanuele
Emanuele

Reputation: 1456

How to optimize large data manipulation in parallel

I'm developing a C/C++ application to manipulate large quantities of data in a generic way (aggregation/selection/transformation). I'm using a AMD Phenom II X4 965 Black Edition, so with decent amount of different caches.

I've developed both ST and MT version of the functions to perform all the single operations and, not surprisingly, in the best case the MT version are 2x faster than the ST, even when using 4 cores.

Given I'm a fan of using 100% of available resources, I was pissed about the fact just 2x, I'd want 4x.
For this reason I've spent already quite a considerable amount of time with -pg and valgrind, using the cache simulator and callgraph. The program is working as expected and cores are sharing the input process data (i.e. operations to apply on data) and the cache misses are reported (as expected sic.) when the different threads load the data to be processed (millions of entities or rows if now you have an idea what I'm trying to do :-) ). Eventually I've used different compilers, g++ and clang++, with -O3 both, and performance is identical.

My conclusion is that due to the large amount of data (GB of data) to process, given the fact the data has got to be loaded eventually in the CPU, this is real wait time. Can I further improve my software? Have I hit a limit?

I'm using C/C++ on Linux x86-64, Ubuntu 11.10. I'm all ears! :-)

Upvotes: 3

Views: 352

Answers (1)

What kind of application is it? Could you show us some code?

As I commented, you might have reached some hardware limit like RAM bandwidth. If you did, no software trick could improve it.

You might investigate using MPI, OpenMP, or OpenCL (on GPUs) but without an idea of your application we cannot help.

If compiling with GCC and if you want to help the processor cache prefetching, consider using with care and parsimony __builtin_prefetch (but using it too much or badly would decrease performance).

Upvotes: 1

Related Questions