Reputation: 119
I have sucessfully implemented a single threaded program in CUDA for Gaussian elimination and would like to achieve parallelism. Up to this point the parallel code looks like:
__global__ void ParallelGaussian(double* A)
{
int index = threadIdx.x;
int stride = blockDim.x;
if (index < ROWS) //Skip additional threads
{
for (unsigned int r = index; r < ROWS; r += stride)
{
//Forward elimination to reduce to row echelon form
for (unsigned int k = r + 1; k < ROWS; ++k)
{
double c = -A[(ROWS + 1) * k + r] / A[(ROWS + 1) * r + r];
for (unsigned int j = r; j < ROWS + 1; ++j)
{
if (r == j)
A[(ROWS + 1) * k + j] = 0.0;
else
A[(ROWS + 1) * k + j] += c * A[(ROWS + 1) * r + j];
}
}
}
}
}
As we can see the code on the GPU will transform the 1D-array (matrix) to a lower triangular matrix and then on the CPU I will continue with back substitution to get the final result. There is no pivoting done in this approach as it is not entirely needed but indeed improves the numerical stability of the algorithm.
Launching the kernel with a single thread and a block works and transforms the matrix into row echelon form:
ParallelGaussian << < 1, 1 >> >(dev_a);
However, if I would like to increase the number of threads, like
ParallelGaussian << < 1, 32 >> >(dev_a);
it will fail to produce the lower triangular matrix. Now adding __syncthreads() calls into the code in order to synchronize the threads in a block doesn't improve the situation what so ever and I can't figure out why.
Upvotes: 0
Views: 747
Reputation: 32732
Consider your inner loop. Every thread accesses A
, and since k
and j
run from r
to the end of the matrix, there is the potential for multiple threads to modify the same A[(ROWS + 1) * k + j]
value.
You also potentially have some threads accessing A[(ROWS + 1) * r + j]
while other threads are updating that value.
One possible solution is to have each thread accumulate into individual result arrays, then combine those at the end. This is memory intensive.
Another would be to restructure this so that only one thread will write to a particular value, and storing those values in a new matrix (so that you don't change any value that might be needed by a different thread).
Upvotes: 1