Gaussian Elimination Parallelism

Question

I have sucessfully implemented a single threaded program in CUDA for Gaussian elimination and would like to achieve parallelism. Up to this point the parallel code looks like:

__global__ void ParallelGaussian(double* A)
{
    int index = threadIdx.x;
    int stride = blockDim.x;

    if (index < ROWS) //Skip additional threads
    {
        for (unsigned int r = index; r < ROWS; r += stride)
        {
            //Forward elimination to reduce to row echelon form
            for (unsigned int k = r + 1; k < ROWS; ++k)
            {
                double c = -A[(ROWS + 1) * k + r] / A[(ROWS + 1) * r + r];
                for (unsigned int j = r; j < ROWS + 1; ++j)
                {
                    if (r == j)
                        A[(ROWS + 1) * k + j] = 0.0;
                    else
                        A[(ROWS + 1) * k + j] += c * A[(ROWS + 1) * r + j];
                }
            }
        }
    }
}

As we can see the code on the GPU will transform the 1D-array (matrix) to a lower triangular matrix and then on the CPU I will continue with back substitution to get the final result. There is no pivoting done in this approach as it is not entirely needed but indeed improves the numerical stability of the algorithm.

Launching the kernel with a single thread and a block works and transforms the matrix into row echelon form:

ParallelGaussian << < 1, 1 >> >(dev_a);

However, if I would like to increase the number of threads, like

ParallelGaussian << < 1, 32 >> >(dev_a);

it will fail to produce the lower triangular matrix. Now adding __syncthreads() calls into the code in order to synchronize the threads in a block doesn't improve the situation what so ever and I can't figure out why.

Gaussian Elimination Parallelism

Answers (1)

Related Questions