Reputation: 826
I need to execute a function about 10^11 times. The function is self-contained and requires one integer as input, let's call it f(n). The range of n is in fact 0 < n < 10^11. We can ignore inclusion of endpoints, I just need the concept about running something of this magnitude in terms of indexes on CUDA.
I want to run this function using CUDA, but I have troubles conceptually. Namely, I know how to simulate my n, mentioned above, using the blocks and threads indexes. As shown in slide 40 of, nVidia Tutorial But, what happens when n>TotalNumberOfThreadsPer_CUDA_Call.
Essentially, does the thread count and block count reset for every call I make to run functions on CUDA? If so, is there a simple way to simulate n, as described earlier, for arbitrarily large n?
Thanks.
Upvotes: 1
Views: 331
Reputation: 2250
If you have to store the data instead of just computing it, you will need to do it in an iterative method. 10^11 values of any type are not going to fit in GPU memory.
I haven't compiled this code, but hopefully you'll get the gist.
__device__ double my_function(int value);
__global__ void my_kernel(int* data, size_t offset, size_t chunk_size) {
size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
size_t stride = gridDim.x*blockDim.x;
while(idx<chunk_size){
data[idx]=my_function(idx+offset);
idx+=stride;
}
}
void runKernel(size_t num_values){
size_t block_size = 128;
size_t grid_size = 1024;
size_t free_mem, total_mem;
cudaMemGetInfo(&free, &total);
size_t chunk_size = sizeof(double)/free_mem;
double *data;
cudaMalloc(&data, chunk_size);
for(size_t i=0; i<num_values; i+=chunk_size){
my_kernel<<<grid_size, block_size>>>(data, i, chunk_size);
//copy to host and process
//or call another kernel on device to process further
}
//process remainder of values that need to be run assuming num_values%chunk_size!=0
}
Upvotes: 0
Reputation: 37945
A common pattern when you want to process more elements than there are threads is to simply loop over your data in grid-sized chunks:
__global__ void kernel(int* data, size_t size) {
for (size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
idx < size;
idx += gridDim.x * blockDim.x) {
// do something with data[idx] ...
}
}
Another option is to launch several consecutive kernels with a start offset:
__global__ void kernel(int* data, size_t size, size_t offset) {
size_t idx = blockIdx.x * blockDim.x + threadIdx.x + offset;
if (idx < size) {
// do something with data[idx] ...
}
}
// Host code
dim3 gridSize = ...;
dim3 blockSize = ...;
for (size_t offset = 0; offset < totalWorkSize; offset += gridSize * blockSize) {
kernel<<<gridSize, blockSize>>>(data, totalWorkSize, offset);
}
In both cases, you can process an "arbitrarily large" number of elements. You're still limited by size_t
, so for 10^11 elements you will need to compile your code for 64 bits.
Upvotes: 4