superscalar
superscalar

Reputation: 725

multi-thread matrix multiplication

I'm writing a code that does N X N matrix multiplication using thread level parallelism.

To get C = A X B,
first I transposed matrix B, divided matrices into blocks.

A thread takes a block from A and B and multiply them,
and then adds the result to the corresponding block in C.
All matrices are allocated in heap memory using malloc().

But the problem is that for the same input,
the answer is sometimes incorrect and sometimes correct.

I'm not quite sure why this happens,
but I guess the code need to be improved in terms of thread safety.

I post some part of my code.
blocks is the number of blocks in row and column, i.e. N / block size.
So the total number of threads is blocks^3.

while (thread_loaded < total_thread)
 {
  if (thread_count < MAX_THREAD)
  {
   p[thread_loaded].idx = thread_idx;
   p[thread_loaded].matinfo = &mi;   
   threads[thread_loaded] = CreateThread(NULL, 0, matmul_tlp, &p[thread_loaded], 0, &threadid[thread_loaded]);
   if (threads[thread_loaded] != NULL)
   {     
    thread_loaded++;
    thread_count++;
    thread_idx += blocks;
    if (thread_idx >= total_thread)
     thread_idx = (thread_idx % total_thread) + 1;
   }          
  }
 }

for the thread function,

int i, j, k;  
param* p = (param*)arg; 

int blocks = BLOCKS;
int block_size = BLOCK_SIZE;
int Ar = p->idx / (blocks * blocks); 
int Ac = p->idx % blocks;
int Br = (p->idx / blocks) % blocks; 
int Bc = p->idx % blocks;
int Cr = p->idx / (blocks * blocks);
int Cc = (p->idx % (blocks * blocks)) / blocks;

double** A = p->matinfo->A;
double** B = p->matinfo->B;
double** C = p->matinfo->C;


DWORD res = WaitForSingleObject(mutex[Cr * blocks + Cc], INFINITE);
if (res != WAIT_OBJECT_0)
 perror("cannot acquire mutex.");


for (i = 0; i < block_size; i++)
{
 for (j = 0; j < block_size; j++)
 {
  for (k = 0; k < block_size; k++)
  {
   C[Cr * block_size + i][Cc * block_size + j] +=
    A[Ar * block_size + i][Ac * block_size + k] *
    B[Br * block_size + j][Bc * block_size + k];
  }
 }
}


ReleaseMutex(mutex[Cr * blocks + Cc]);


thread_count--;
return NULL;

It would be appreciated if anyone can find a hole. :)

Upvotes: 2

Views: 4177

Answers (2)

Claudio Corsi
Claudio Corsi

Reputation: 359

Note that matrix multiplication lends itself to using multiple threads in a non-blocking manner. The thing to remember is that c[i,j] = sum a[i,k] * b[k,j] for all k. This action is self contained and requires no synchronization within you thread code. The starting thread would just need a reference to a class or struct that contains the a,b,c matrix, an entries container that needs to be synchronized and the mutex used to synchronize the entries container. Accessing this entries container would need a mutex to access it but everything else would not required any mutex calls. Each entry in the entries container will contain the current row and column for the c matrix that will be calculated next. Each thread would then query the entries container until the container is empty. The only real task is generating the entries container prior to creating/starting the threads.

Upvotes: 0

Ze Blob
Ze Blob

Reputation: 2957

There are two shared values that are being written to in your threaded function. The C matrix and the thread_count variable. I don't see anything wrong with how the C matrix is synchronised but it's worth double checking to make sure that the Cc and Cr values are properly computed since this is what your mutex choice is based on.

The most likely source of error is the thread_count variable which is not synchronized. Remember that a-- is the equivalent of 'a = a - 1 and consists of one read followed by one write operation (it's NOT atomic). This means it's very likely to be preempted between the read and the write and thus some decrements of the variables could be lost. For example

Thread A reads the value 10 from thread_count.
Thread B reads the value 10 from thread_count.
Thread B writes the value 10-1 into thread_count.
Thread A writes the value 10-1 into thread_count.
thread_count is now equal to 9 when it should be 8.

So you've just lost one of your signals to the thread creation function. This could cause your algorithm to run with less and fewer threads as you lose decrements. Sadly, unless I missed something, I don't see it explaining why you get bad results. Your program will just run slower than it really should.

Easiest way to fix this would be to use a condition variable which is made specifically for these types of situation and avoids the lost-wakeup problem that you have here.

By the way, creating threads is usually a pretty costly operation so a thread pool would be preferable to creating and destroying tons of threads. If thread pools are more complex then necessary, you could use conditions variables to make your threads wait for a new task once they're done with their current one.

EDIT: I wrote the solution a little too quickly and I ended up confusing two different issues.

First you want to synchronise your accesses to the thread_count variable in both the thread creation function and the calculation function. To do that you can either use a mutex or you can use atomic operands like InterlockedDecrement, InterlockedIncrement (I hope these are the correct Windows equivalent). Note that, while I don't see any issues with using atomic operands in this case, they are not as simple to use as they might seem. Make sure you fully understand them before using them.

The second issue is that you're spinning in your thread creation function while waiting for one of the threads to finish. You can avoid that by using a condition variables to signal the main thread once its calculations are completed. That way you don't end up taking processor time for no reason.

One more thing, make sure your thread_count variable is declared volatile to indicate to the compiler that the order in which it's read and written within the function is important. Otherwise, the compiler is free to re-order the operations to increase performance.

Upvotes: 1

Related Questions