Reputation: 1035
I want write a CUDA kernel that computes histograms.
Suppose I have the following array:
| 1 | 10 | 30 | 39 | 32 | 2 | 4 | 5 | 1 |
with no value exceeding maxValue
(40 in my example). I want to create a histogram, say using the following 4 buckets:
0 - 9 (1st bucket)
10 - 19 (2nd bucket)
20 - 29 (3rd bucket)
30 - 39 (4th bucket)
I first thought of creating partial histogram in each block using shared memory (temp
array).
__global__ void histo_kernel_optimized5(unsigned char *buffer, long size, unsigned int *histo) {
extern __shared__ unsigned int temp[];
temp[threadIdx.x] = 0;
__syncthreads();
int i = threadIdx.x + blockIdx.x * blockDim.x;
int offset = blockDim.x * gridDim.x;
int bucketID;
while (i < size) {
bucketID = array[i] / Bwidth;
atomicAdd(&temp[bucketID], 1);
i += offset;
}
__syncthreads();
atomicAdd(&(histo[threadIdx.x]), temp[threadIdx.x]);
}
This is how I invoke my kernel:
histo_kernel_optimized<<<array_size / buckets, buckets, buckets * sizeof(unsigned int)>>>(buffer, SIZE, histogram);
But compilation fails with:
Instruction '{atom,red}.shared' requires .target sm_12 or higher
Note: My GPU has Compute Capability 1.1.
I also tried having each thread create its own temp
array:
__global__ void histo_kernel_optimized5(unsigned char *buffer, long size, unsigned int *histo) {
unsigned int temp[buckets];
int j;
for (j = 0; j < buckets; j++) {
temp[j] = 0;
}
int bucketID;
int i = threadIdx.x + blockIdx.x * blockDim.x;
int offset = blockDim.x * gridDim.x;
while (i < size) {
bucketID = array[i] / Bwidth;
temp[bucketID]++;
i += offset;
}
for (j = 0; j < buckets; j++) {
histo[j] += temp[j];
}
}
This does not compile either. Apparently temp
has to be declared with a constant size. But I want to support setting the number of buckets dynamically (the user should be able to set them via the command line upon invoking my program).
What am I doing wrong? How to implement this correctly?
Upvotes: 0
Views: 10781
Reputation: 645
There is a solution for devices without Atomic Operations and shows a approach to minimize onchip memory collisions, with subdivisions into warps proproused by Podlozhnyuk at Histogram calculation in CUDA
The code is at CUDASamples\3_Imaging\histogram (from CUDA Samples)
Upvotes: 1
Reputation: 96
When using atomics, launching fewer blocks will reduce contention (and hence improve performance) because it will not have to coordinate between fewer blocks. Launch fewer blocks and have each block loop over more of the input elements.
for (unsigned tid = blockIdx.x*blockDim.x+threadIdx.x;
tid < size; tid += gridDim.x*blockDim.x) {
unsigned char value = array[tid]; // borrowing notation from another answer here
int bin = value % buckets;
atomicAdd(&histo[bin],1);
}
Upvotes: 8
Reputation: 16796
Histogram is really easy to implement using atomic operations. I don't know why you are writing such a complex kernel. The motivation to parallelize the operation is to exploit the parallel nature of algorithm. There is no need to iterate over the entire histogram inside the kernel. Here is a sample CUDA kernel and wrapper function to calculate the histogram of an array with specified number of bins. I don't think it can be further optimized for Compute 1.1 devices. But for Compute 1.2, shared memory can be utilized.
__global__ void kernel_getHist(unsigned char* array, long size, unsigned int* histo, int buckets)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if(tid>=size) return;
unsigned char value = array[tid];
int bin = value % buckets;
atomicAdd(&histo[bin],1);
}
void getHist(unsigned char* array, long size, unsigned int* histo,int buckets)
{
unsigned char* dArray;
cudaMalloc(&dArray,size);
cudaMemcpy(dArray,array,size,cudaMemcpyHostToDevice);
unsigned int* dHist;
cudaMalloc(&dHist,buckets * sizeof(int));
cudaMemset(dHist,0,buckets * sizeof(int));
dim3 block(32);
dim3 grid((size + block.x - 1)/block.x);
kernel_getHist<<<grid,block>>>(dArray,size,dHist,buckets);
cudaMemcpy(histo,dHist,buckets * sizeof(int),cudaMemcpyDeviceToHost);
cudaFree(dArray);
cudaFree(dHist);
}
Upvotes: 4