vegeta
vegeta

Reputation: 105

CUDA different threads per block for different functions

I making a CUDA program and am stuck at a problem. I have two functions:

  1. __global__ void cal_freq_pl(float *, char *, char *, int *, int *)
  2. __global__ void cal_sum_vfreq_pl(float *, float *, char *, char *, int *)

I call the first function like this: cal_freq_pl<<<M,512>>>( ... ); M is a number about 15, so I'm not worried about it. 512 is the maximum threads per block on my GPU. This works fine and gives the expected output for all M*512 values.

But when I call the 2nd function in a similar way: cal_sum_vfreq_pl<<<M,512>>>( ... ); it does not work. After debugging the crap out of that function, I finally found out that it runs with these dimensions: cal_sum_vfreq_pl<<<M,384>>>( ... );, which is 128 less than 512. It shows no error with 512, but incorrect result.

I currently only have access to Compute1.0 arch and have Nvidia Quadro FX4600 graphics card on Windows 64-bit machine.

I have no idea why such a behavior should happen, I am positively sure that the 1st function is running for 512 threads and the 2nd only runs for 384 (or less).

Can someone please suggest some possible solution?

Thanks in advance...

EDIT: Here is the kernel code:

__global__ void cal_sum_vfreq_pl(float *freq, float *v_freq_vectors, char *wstrings, char *vstrings, int *k){
    int index = threadIdx.x;
    int m = blockIdx.x;
    int block_dim = blockDim.x;
    int kv = *k; int vv = kv-1; int wv = kv-2;
    int woffset = index*wv;
    int no_vstrings = pow_pl(4, vv);
    float temppp=0;
    char wI[20], Iw[20]; int Iwi, wIi;
    for(int i=0;i<wv;i++) Iw[i+1] = wI[i] = wstrings[woffset + i];
    for(int l=0;l<4;l++){
            Iw[0] = get_nucleotide_pl(l);
            wI[vv-1] = get_nucleotide_pl(l);
            Iwi = binary_search_pl(vstrings, Iw, vv);
            wIi = binary_search_pl(vstrings, wI, vv);
            temppp = temppp + v_freq_vectors[m*no_vstrings + Iwi] + v_freq_vectors[m*no_vstrings + wIi];
    }
    freq[index + m*block_dim] = 0.5*temppp;
}

Upvotes: 0

Views: 596

Answers (1)

kangshiyin
kangshiyin

Reputation: 9779

It seems you allocated a lot of registers in the second kernel. You can not always reach the max threads per block due to the hardware resource limitation such as register number per block.

CUDA provides a tool to help calculate the proper nember of threads per block.

http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls

You can also find this .xls file in your CUDA installation dir.

Upvotes: 1

Related Questions