Reputation: 11
Each thread in a block can have different set (and size) of results. At the moment i am allocating fixed size of device memory; think per-thread.
Meaning, for XX threads i Have to allocate XX * max_result_count * data_structure * sizeof(int), my data contains integers. Each thread access its memory block (offset) by calculating int i = blockDim.x * blockIdx.x + threadIdx.x; and multiplying it with max_result_count*data_structure, for integer array;
In the real world this means huge waste of device memory, because some sets are close to 0, some are not. For example, i Have to allocate under 2GB of device memory to be able to store an equivalent of 300MB of results.
Any ideas on how to rework this ?
For example, each thread locks mutex, increments actual res_count, writes data into shared memory block, unlocks mutex.
[Problem solved, thanks, guys !]
Upvotes: 0
Views: 261
Reputation: 152173
You've already hinted in your question at one possible approach:
#define DSIZE (100*1048576)
__device__ unsigned int buffer_index = 0;
__device__ int *buffer_data;
In your host code:
int *buffer_data_temp;
cudaMalloc(&buffer_data_temp, sizeof(int)*DSIZE);
cudaMemcpyToSymbol(buffer_data, &buffer_data_temp, sizeof(int *));
In your thread code:
unsigned int my_buffer_offset = atomicAdd(&buffer_index, size_of_my_thread_data);
assert((my_buffer_offset+size_of_my_thread_data) < DSIZE);
memcpy(buffer_data+my_buffer_offset, my_thread_data, size_of_my_thread_data*sizeof(int));
(disclaimer: coded in browser, not tested)
It's not necessary to use a mutex, for example around the memcpy
operation. Once we have reserved the starting and ending points of our allocation with the atomicAdd
, the threads will not step on each other, even if all are writing data, because they are writing to separate regions within buffer_data
.
EDIT: Here's a complete example:
#include <stdio.h>
#include <assert.h>
#define DSIZE (100*1048576)
#define nTPB 32
#define BLKS 2
__device__ unsigned int buffer_index = 0;
__global__ void update_buffer(int *buffer_data){
const unsigned int size_of_my_thread_data = 1;
unsigned int my_buffer_offset = atomicAdd(&buffer_index, size_of_my_thread_data);
assert((my_buffer_offset+size_of_my_thread_data) < DSIZE);
int my_thread_data[size_of_my_thread_data];
my_thread_data[0] = (blockIdx.x*10000) + threadIdx.x;
memcpy(buffer_data+my_buffer_offset, my_thread_data, size_of_my_thread_data*sizeof(int));
}
int main(){
int *h_buffer_data, *d_buffer_data;
cudaMalloc(&d_buffer_data, sizeof(int)*DSIZE);
update_buffer<<<BLKS, nTPB>>>(d_buffer_data);
unsigned int result_size;
cudaMemcpyFromSymbol(&result_size, buffer_index, sizeof(unsigned int));
h_buffer_data = (int *)malloc(sizeof(int)*result_size);
cudaMemcpy(h_buffer_data, d_buffer_data, result_size*sizeof(int),cudaMemcpyDeviceToHost);
for (int i = 0; i < result_size; i++)
printf("%d\n", h_buffer_data[i]);
return 0;
}
Upvotes: 1
Reputation: 10596
Rewrite the kernel and calling function to calculate a part of the required points (obviously, you'll have to change the number of blocks per launch, etc.).
int offset = 0;
for(int i = 0; i < numKernelLaunches; i++) {
yourKernel<<<numBlocks,threadsPerBlock>>>(offset, /* your other parameters */);
offset += numBlocks*threadsPerBlock;
cudaDeviceSynchronize();
}
and in yourKernel
you keep int i = blockDim.x * blockIdx.x + threadIdx.x;
as the index for the global memory access and i + offset
for the id of your data position.
Upvotes: 1