user3314134
user3314134

Reputation: 11

cuda; using device memory shared

Each thread in a block can have different set (and size) of results. At the moment i am allocating fixed size of device memory; think per-thread.

Meaning, for XX threads i Have to allocate XX * max_result_count * data_structure * sizeof(int), my data contains integers. Each thread access its memory block (offset) by calculating int i = blockDim.x * blockIdx.x + threadIdx.x; and multiplying it with max_result_count*data_structure, for integer array;

In the real world this means huge waste of device memory, because some sets are close to 0, some are not. For example, i Have to allocate under 2GB of device memory to be able to store an equivalent of 300MB of results.

Any ideas on how to rework this ?

For example, each thread locks mutex, increments actual res_count, writes data into shared memory block, unlocks mutex.

[Problem solved, thanks, guys !]

Upvotes: 0

Views: 261

Answers (2)

Robert Crovella
Robert Crovella

Reputation: 152173

You've already hinted in your question at one possible approach:

#define DSIZE (100*1048576)

__device__ unsigned int buffer_index = 0;
__device__ int *buffer_data;

In your host code:

int *buffer_data_temp;
cudaMalloc(&buffer_data_temp, sizeof(int)*DSIZE); 
cudaMemcpyToSymbol(buffer_data, &buffer_data_temp, sizeof(int *));

In your thread code:

unsigned int my_buffer_offset = atomicAdd(&buffer_index, size_of_my_thread_data);
assert((my_buffer_offset+size_of_my_thread_data) < DSIZE);
memcpy(buffer_data+my_buffer_offset, my_thread_data, size_of_my_thread_data*sizeof(int));

(disclaimer: coded in browser, not tested)

It's not necessary to use a mutex, for example around the memcpy operation. Once we have reserved the starting and ending points of our allocation with the atomicAdd, the threads will not step on each other, even if all are writing data, because they are writing to separate regions within buffer_data.

EDIT: Here's a complete example:

#include <stdio.h>
#include <assert.h>
#define DSIZE (100*1048576)
#define nTPB 32
#define BLKS 2

__device__ unsigned int buffer_index = 0;

__global__ void update_buffer(int *buffer_data){
  const unsigned int size_of_my_thread_data = 1;
  unsigned int my_buffer_offset = atomicAdd(&buffer_index, size_of_my_thread_data);
  assert((my_buffer_offset+size_of_my_thread_data) < DSIZE);
  int my_thread_data[size_of_my_thread_data];
  my_thread_data[0] = (blockIdx.x*10000) + threadIdx.x;
  memcpy(buffer_data+my_buffer_offset, my_thread_data, size_of_my_thread_data*sizeof(int));
}

int main(){

  int *h_buffer_data, *d_buffer_data;
  cudaMalloc(&d_buffer_data, sizeof(int)*DSIZE);
  update_buffer<<<BLKS, nTPB>>>(d_buffer_data);
  unsigned int result_size;
  cudaMemcpyFromSymbol(&result_size, buffer_index, sizeof(unsigned int));
  h_buffer_data = (int *)malloc(sizeof(int)*result_size);
  cudaMemcpy(h_buffer_data, d_buffer_data, result_size*sizeof(int),cudaMemcpyDeviceToHost);
  for (int i = 0; i < result_size; i++)
    printf("%d\n", h_buffer_data[i]);
  return 0;
}

Upvotes: 1

Avi Ginsburg
Avi Ginsburg

Reputation: 10596

Rewrite the kernel and calling function to calculate a part of the required points (obviously, you'll have to change the number of blocks per launch, etc.).

int offset = 0;
for(int i = 0; i < numKernelLaunches; i++) {
    yourKernel<<<numBlocks,threadsPerBlock>>>(offset, /* your other parameters */);
    offset += numBlocks*threadsPerBlock;
    cudaDeviceSynchronize();
}

and in yourKernel you keep int i = blockDim.x * blockIdx.x + threadIdx.x; as the index for the global memory access and i + offset for the id of your data position.

Upvotes: 1

Related Questions