Reputation: 81
I am trying to write a Cuda kernel to generate row-wise histogram based on the input feature set (2 x 6) where each feature row (each having 6 features) is to generate a histogram having nbins=10. I have implemented the below code but it doesn’t seem to generate the correct row-wise histogram.
import numba
import numpy as np
from numba import cuda
np.random.seed(0)
feature = np.random.randint(1, high=6, size=(2,6), dtype=int)
output = np.zeros(20).astype(np.float32).reshape(2,10)
### Kernal Configuration
threads_per_block = 6
blocks = 2
# moving data to device
d_feature = cuda.to_device(feature)
d_output = cuda.to_device(output)
feature_size = d_feature.shape[1]
@cuda.jit
def row_wise_histogram(feature, output, n):
xmin = np.float32(-4.0)
xmax = np.float32(4.0)
idx = cuda.grid(1)
nbins = 10
bin_width = (xmax - xmin) / nbins
for i in range(n):
# Each thread will take all the row features to generate historgram
input = feature[idx][i]
bin_number = np.int32(nbins * (np.float32(input) - np.float32(xmin)) / (np.float32(xmax) - np.float32(xmin)))
if bin_number >= 0 and bin_number < output.shape[1]:
cuda.atomic.add(output[idx], bin_number, 1)
row_wise_histogram[blocks, threads_per_block](d_feature, d_output, feature_size)
print(d_output.copy_to_host())
And the out results in
[[ 0. 0. 0. 0. 0. 0. 81111. 81111. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 162222. 0. 81111. 0.]]
which is wrong, Will appreciate it if I can get help with the issue inside the row_wise_historgram function!
Upvotes: 0
Views: 236
Reputation: 151809
I think the main issue you have in your code is that your kernel has a thread strategy to have each thread process a row, and you have 2 rows in your feature
dataset, but you are launching 12 threads total:
### Kernal Configuration
threads_per_block = 6
blocks = 2
10 of those threads will be indexing out-of-bounds. For 2 rows you only need 2 threads. We can fix this multiple ways, but I will add a "thread-check" to your kernel code, to prevent out-of-bounds threads from doing anything.
You are also histogramming values that don't fit in your output array. Let's suppose your feature
has a input value of 4 at some location. Let's put that value through your arithmetic:
bin_number = np.int32(nbins * (np.float32(4) - np.float32(-4)) / (np.float32(4) - np.float32(-4)))
That is 10 * (4-(-4))/(4-(-4))
So that is a bin index of 10. But you only have 10 bins, so valid bin index can only go up to 9. Which means some of your input values (e.g. 4, 5) will not be recorded in your output.
The following code is your code to add the threadcheck, plus the range of input adjusted. And I am printing out the input, the bin each input value was assigned, and the output bins. It seems to be working correctly.
$ cat t65.py
import numba
import numpy as np
from numba import cuda
np.random.seed(0)
feature = np.random.randint(1, high=4, size=(2,6), dtype=int)
output = np.zeros(20).astype(np.float32).reshape(2,10)
mybin = np.empty_like(feature)
### Kernal Configuration
threads_per_block = 6
blocks = 2
# moving data to device
d_feature = cuda.to_device(feature)
d_output = cuda.to_device(output)
feature_size = d_feature.shape[1]
d_mybin = cuda.to_device(mybin)
@cuda.jit
def row_wise_histogram(feature, output, mybin, n):
xmin = np.float32(-4.0)
xmax = np.float32(4.0)
idx = cuda.grid(1)
nbins = 10
bin_width = (xmax - xmin) / nbins
if idx < output.shape[0]:
for i in range(n):
# Each thread will take all the row features to generate historgram
input = feature[idx][i]
bin_number = np.int32(nbins * (np.float32(input) - np.float32(xmin)) / (np.float32(xmax) - np.float32(xmin)))
mybin[idx][i] = bin_number
if bin_number >= 0 and bin_number < output.shape[1]:
cuda.atomic.add(output[idx], bin_number, 1)
row_wise_histogram[blocks, threads_per_block](d_feature, d_output, d_mybin, feature_size)
print(feature)
print(d_mybin.copy_to_host())
print(d_output.copy_to_host())
$ python t65.py
[[1 2 1 2 2 3]
[1 3 1 1 1 3]]
[[6 7 6 7 7 8]
[6 8 6 6 6 8]]
[[ 0. 0. 0. 0. 0. 0. 2. 3. 1. 0.]
[ 0. 0. 0. 0. 0. 0. 4. 0. 2. 0.]]
$ cuda-memcheck python t65.py
========= CUDA-MEMCHECK
[[1 2 1 2 2 3]
[1 3 1 1 1 3]]
[[6 7 6 7 7 8]
[6 8 6 6 6 8]]
[[ 0. 0. 0. 0. 0. 0. 2. 3. 1. 0.]
[ 0. 0. 0. 0. 0. 0. 4. 0. 2. 0.]]
========= ERROR SUMMARY: 0 errors
$
Note that when I restrict the input values to 1..3, then the maximum bin index is 8 (do the math). If I increase the input range to include 4, the maximum bin index goes to 10, which "won't fit". You're correctly handling this case, but it may confuse you, as these values of 4 or 5 won't be recorded in the output. Histogram bin arithmetic is fun. You will need to work out exactly what you want.
Also note that if you run this code, you should see output almost exactly the same as above. If you don't, there is a good chance your numba or cuda install is broken somehow, and the additional run I show with cuda-memcheck
will help to discover what may be the issue.
Note that since you are using atomics anyway, there isn't any particular need to assign one thread to each row, you could instead assign one thread to each input point. But that isn't your question; it's a story for another day. Conversely, if you do proceed with one thread per row, each thread doing effectively a private histogram, there is no particular need to use atomics.
Upvotes: 2