K rakesh
K rakesh

Reputation: 17

CUDA kernel for Finite Element Assembly

We have an unstructured tetrahedral mesh file containing following format:

element-ID  nod1 nod2 nod3 nod4


    1            452  3434  322 9000

    2            2322   837 6673 2323

    .
    .
    .

300000

We partitioned the above mesh for partition size of 2048 each. For each partition size of 2048 contains unique nod1 nod2 nod3 nod4 values, we pass 1 block and 512 threads at different start index.

In a cuda file, we have

__global__ void calc(double d_ax,int *nod1,int *node2,int *nod3,int *nod4,int   start,int size)
{
    int n1,n2,n3,n4;     
    int i = blockIdx.x * blockDim.x + threadIdx.x + start;


    if ( i < size )
    {

        n1=nod1[i];
        n2=nod2[i];
        n3=nod3[i];
        n4=nod4[i];

        ax[n1] += some code;
        ax[n2] += some code;
        ax[n3] += some code;
        ax[n4] += some code;
    }
}

We call the kernel as

calc<<<1,512>>>(d_ax,....,0,512);
calc<<<1,512>>>(d_ax,....,512,512);
calc<<<1,512>>>(d_ax,....,1024,512); 
calc<<<1,512>>>(d_ax,....1536,512);

the above code works well but the problem is we get different results using more than one block at a time. For example:

calc<<<2,512>>>(d_ax,....,0,1024); 
calc<<<2,512>>>(d_ax,....,1024,1024); 

Can anyone help me?

Upvotes: 0

Views: 303

Answers (2)

CygnusX1
CygnusX1

Reputation: 21778

For each partition size of 2048 contains unique nod1 nod2 nod3 nod4 values

but in two partition sets the same node index can appear? If in two different blocks you have

Block 1: ax[1234]=do something

Block 2: ax[1234]=do something else

it smells as a race condition. You never know which one of the two blocks will be faster to write....

Upvotes: 0

talonmies
talonmies

Reputation: 72348

I am not sure how you expect anyone to tell you what might be wrong when the code you have posted is incomplete and uncompilable, but if in your single block case you really are calling the kernel as you have posted, this is what should happen:

calc<<<1,512>>>(d_ax,....,0,512);    // process first 512 elements
calc<<<1,512>>>(d_ax,....,512,512);  // start >= 512, size == 512, does nothing
calc<<<1,512>>>(d_ax,....,1024,512); // start >= 1024, size == 512, does nothing
calc<<<1,512>>>(d_ax,....1536,512);  // start >= 1536, size == 512, does nothing

So irrespective of whether your code might be broken when run using multiple blocks, your results for the single block case are probably wrong, and the whole point of your question is probably irrelevant as a result.

If you want a better answer, edit your question so it contains a complete description of the problem and concise, complete code that could actually be compiled. Otherwise this is about as much as anybody could guess from the information you have provided.

Upvotes: 1

Related Questions