Reputation: 13614
I tr to allocate a cuda global memory array. I have the kernel of summation as:
__device__ float R_d = 0;
__global__ void perform_summation(float* A, int N){
int idx = blockDim.x*blockIdx.x+threadIdx.x;
extern __shared__ float sharedArray [];
float result[]; //THIS IS THE THING i TRIED TO CREATE
if(idx < N){
sharedArray[threadIdx.x] = A[idx];
// }else{
// sharedArray[threadIdx.x] = 0 ;
// }
for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) {
__syncthreads();
if(threadIdx.x % (2*stride) == 0){
sharedArray[threadIdx.x]+=sharedArray[threadIdx.x+stride];
}
}
}
if(idx % blockDim.x == 0){
// R_d += sharedArray[threadIdx.x];
result[blockIdx.x] = sharedArray[threadIdx.x];
}
for (int i = 0; i < gridDim.x; ++i) {
R_d += result[i];
}
}
As summary y kernel takes an array and finds the summation of the elements by map reduce approach. Each blocks take the relative elements into shared memory and sums all the data inside than put the result to the global array that I try to create. At the final, I will sum all the numbers of the global array to find the last answer.
As the first method I did not use global array to collect the results of each block and I just sum the results of the blocks into the variable R_d
but it does not work and shows only the value comes from the last block as the result. I guess since I do not have sync. between blocks last block overwrites all the values at the end. Here is the what I've done at first attempt at the end of the kernel
f(idx < N){
sharedArray[threadIdx.x] = A[idx];
// }else{
// sharedArray[threadIdx.x] = 0 ;
// }
for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) {
__syncthreads();
if(threadIdx.x % (2*stride) == 0){
sharedArray[threadIdx.x]+=sharedArray[threadIdx.x+stride];
}
}
if(threadIdx.x == 0){
R_d += sharedArray[threadIdx.x];
}
}
So I have actually two questions. How to define a global memory array at the device memory for the first solution I propose and is there any solution for the second solution that just uses the R_d
variable?
Upvotes: 0
Views: 3738
Reputation: 5137
You can allocate array in global device memory by cudaMalloc:
cudaMalloc((void **)&ptr, size);
but you don't wanna do that inside kernel, you do it before you call the kernel and pass the pointer to the kernel.
As for the reduction, take a look at these nVidia slides, it explains it well. Basicly, it depends on how many blocks and threads you use. Lets say there are several blocks. So define an array in shared memory:
__shared__ float cache[BLOCK_THREADS];
Shared memory is allocated for each block, so we sum the values in each block to the first element in cache
.
__syncthreads();
int step = (BLOCK_THREADS >> 1); //the same result as BLOCK_THREADS/2
while(step > 0) {
if (threadInBlock < step) {
cache[threadInBlock] += cache[threadInBlock + step];
}
__syncthreads();
step = (step >> 1);
}
So this sums all elements in each block to cache[0]
. Now we can use reduction again, or we can just sum all the sums from each block with atomic operation. This will be OK if there are significantly less blocks than threads per block.
__syncthreads();
if (threadInBlock == 0) {
atomicAdd(result, cache[0]);
}
Note that result
is a pointer to a single value in global memory. Also note, that this will only work if BLOCK_THREADS
is a power of 2 - which is quite common, because number of threads per block should be multiple of 32 (aligned to warps).
Upvotes: 1