Tudor
Tudor

Reputation: 62459

Parallel application has random behavior

I am writing a C program using pthreads to do a wavefront pattern computation on a bidimensional matrix. To achieve good performance, I distribute several rows to each thread in an interleaved manner, like so:

thread 0 ------------------

thread 1 ------------------

thread 2 ------------------

thread 3 ------------------

thread 0 ------------------

thread 1 ------------------

thread 2 ------------------

thread 3 ------------------

etc.

In this computation I think this is the only viable split, since each thread needs the new values computed on each row and cannot move further until they are available. Now, the catch here is that thread 1 needs the values computed by thread 0 on its row, so it has to trail behind thread 0 and not go ahead of it. For this purpose, I am splitting each row into chunks and protecting each chunk with a critical section. The computation goes like:

thread 0 -----------

thread 1 ------

thread 2 ---

such that always row i has to trail behind row i - 1. I hope you understand the idea. I have implemented this idea and I'm testing it on a machine which is a dual quad-core system. I am experiencing strange behavior. The results are calculated correctly, but the running time varies anywhere between 8x less than the sequential time and up to more than the sequential time. Practically, the sequential time for a 12000 x 12000 matrix is 16 seconds, and the parallel running time is anywhere between 2 and 17 seconds and often differs on two consecutive runs.

My initial idea was that this problem is very locality sensitive, so of course I could be getting bad performance if let's say thread 0 and thread 1 are scheduled on different physical processors. Looking around in /proc/cpuinfo, I have deduced that the cores are mapped such that 0, 2, 4, 6 are on processor 0 and 1, 3, 5, 7 are on processor 1. Then, upon thread creation, I have used pthread_setaffinity_np to set the affinity for the threads on the right cores. However, nothing changes. I have also tried using pthread_attr_setaffinity_np and also sched_setaffinity but I get the same random running time.

Either the kernel is ignoring my affinity calls or this is not the problem. I really hope someone can help me as I have run out of ideas. Thank you.

Upvotes: 4

Views: 344

Answers (1)

DiVan
DiVan

Reputation: 381

Kernel is not ignoring your affinity calls, it's just not the issue.

It seems like cache misses. In case your threads were not interconnected by same data, you could make big gap in memory between lines in order to avoid this.

Try making chunks of data size of single line (64 bytes) Try making thread 0 go through 64 bytes and only after that thread 1 will read from the same place - and so forth.

Upvotes: 7

Related Questions