Reputation: 291
I am using the code presented in Chapter 14 of the OpenCL Progamming Guide to calculate a histogram. It works fine for 256 bins, but unfortunately I need 65536 bins for my application. This leads to the problem that if I use this approach, the local array gets too big.
local uint tmp_histogram[256 * 256];
As a result, the program is not built (CL_BUILD_PROGRAM_FAILURE).
Do you have any ideas how this issue can be solved? I thought of using multiple kernels to compute the values for the different bins (i.e. to split the histogram, so that I first compute the values for the bins 0-255, then for 256-511, etc.). However, in this case I will have to check if a value is within that range before incrementing, which means that I will need conditionals...
Upvotes: 0
Views: 1404
Reputation: 5087
Using global memory would solve the problem, but would not result in a very fast kernel. I suggest creating multiple work groups, and using each group to count a range of values only.
#define RANGE_SIZE 8192
kernel void histo(__global uint data,__constant int dataSize){
int wid = get_local_id(0);
int wSize = get_local_size(0);
int gid = get_group_id(0);
int numGroups = get_num_groups(0);
int rangeStart = gid * RANGE_SIZE / numGroups;
int rangeEnd = (gid+1) * RANGE_SIZE / numGroups;
local uint tmp_histogram[RANGE_SIZE];
uint value;
for(int i=wid; i< dataSize; i+= wSize){
value = data[i];
if(value >= rangeStart && value < rangeEnd){
atomic_inc(tmp_histogram[value - rangeStart]);
}
}
//barrier...
//use the local data here
}
Assumes 32kb local memory available. If you reduce RANGE_SIZE, it does not have to be a power of two, but you do need to make sure you are calling the kernel with enough work groups to hit all values up to 64k.
Upvotes: 2
Reputation: 395
Move your histogram to global storage. A further solution could be to use unsigned short, if your application suits this size. At last you could run your code twice. first time for lower 32000 values, second time for the upper half.
Upvotes: 1