Reputation: 6950

CUDA computing a histogram with shared memory

I'm following a udacity problem set lesson to compute a histogram of numBins element out of a long series of numElems values. In this simple case each element's value is also his own bin in the histogram, so generating with CPU code the histogram is as simple as

for (i = 0; i < numElems; ++i)
  histo[val[i]]++;

I don't get the video explanation for a "fast histogram computation" according to which I should sort the values by a 'coarse bin id' and then compute the final histogram.

The question is:

why should I sort the values by 'coarse bin indices'?

Upvotes: 1

Answers (2)

Ben

Reputation: 193

I think the idea of using a partial radix sort as in Robert Crovella's answer is correct. However, there seems to be some mismatch between the explanation in the problem video and the sample code that makes this strategy inapplicable.

The main issue is that the sample code uses 10000 * 1024 samples in the range [0, 1024). By the pigeonhole principle, this means at least one of the final histogram bins will have 10000 elements in it (since the bins have width 1). Coarse bins are wider than these, so the same is true for them. Therefore, you'll always have at least one coarse bin that won't fit into a CUDA block (in practice, you'll have many).

Note also that the sample code doesn't say anything about coarse histograms. Maybe there was some miscommunication between whoever designed the problems and the TA who made the video.

Nevertheless, in the case where the pigeonhole argument above doesn't apply (when the sample size is sufficiently small and/or the range of values is sufficiently large) the idea of using coarse histograms is an interesting one. The main thing I would add to Robert Crovella's answer (in addition to the use of the partial radix sort) is that you have to make use of the distribution of the input values (this actually is mentioned in the instructions in the sample code). Basically, you need the coarse histogram bins to be sufficiently small that the probability that the number of samples falling into a bin exceeds the size a block is practically 0.

For the given sample code, the samples are normally distributed. The mean is a random value between 462 and 562 and the standard deviation is 100. For a given sample size, you can calculate the probability that a single sample falls into one of the coarse bins. You only need to guarantee that not too many samples fall into the coarse bin near the center of the range of values, because that's where most normal samples end up. If you're using a partial radix sort on the most significant bit, the bins will have width a multiple of 2^(-n) and this probability can be computed using a scientific computation library like scipy or boost.math.

Since the samples are independent, the number of samples falling into this coarse bin will be binomially distributed with parameters n (number of trials) and p (success probability) given by the number of samples and the probability computed above, respectively. Once again, you can use a scientific computation package to compute the probability that the number of samples falling into the central coarse bin is greater than the block size you've chosen. You should observe that as n is increased, this probability approaches zero.

Upvotes: 1

Robert Crovella

Reputation: 151809

why should I sort the values by 'coarse bin indices'?

This is an attempt to break down the work into pieces that can be handled by a single threadblock. There are several considerations here:

On a GPU, it's desirable to have multiple threadblocks so that all SMs can be engaged in solving the problem.
A given threadblock lives and operates on a single SM, so it is confined to the resources available on that SM, the primary limits being the number of threads and the size of available shared memory.
Since shared memory especially is limited, the division of work creates a smaller-sized histogram operation for each threadblock, which may fit in the SM shared memory whereas the overall histogram range may not. For example if I am histogramming over a range of 4 decimal digits, that would be 10,000 bins total. Each bin would probably need an int value, so that is 40Kbytes, which would just barely fit into shared memory (and might have negative performance implications as an occupancy limiter). A histogram over 5 decimal digits probably would not fit. On the other hand, with a "coarse bin sort" of a single decimal digit, I could reduce the per-block shared memory requirement from 40Kbytes to 4Kbytes (approximately).

Shared memory atomics are often considerably faster than global memory atomics, so breaking down the work this way allows for efficient use of shared memory atomics, which may be a useful optimization.

so I will have to sort all the values first? Isn't that more expensive than reading and doing an atomicAdd into the right bin?

Maybe. But the idea of a coarse bin sort is that it may be computationally much less expensive than a full sort. A radix sort is a commonly used, relatively fast sorting operation that can be done in parallel on a GPU. Radix sort has the characteristic that the sorting operation begins with the most significant "digit" and proceeds iteratively to the least significant digit. However a coarse bin sort implies that only some subset of the most significant digits need actually be "sorted". Therefore, a "coarse bin sort" using a radix sort technique could be computationally substantially less expensive than a full sort. If you sort only on the most significant digit out of 3 digits as indicated in the udacity example, that means your sort is only approximately 1/3 as expensive as a full sort.

I'm not suggesting that this is a guaranteed recipe for faster performance in every case. The specifics matter (e.g. size of histogram, range, final number of bins, etc.) The specific GPU you use may impact the tradeoff also. For example, Kepler and newer devices will have substantially improved global memory atomics, so the comparison will be substantially impacted by that. (OTOH, Pascal has substantially improved shared memory atomics, which will once again affect the comparison in the other direction.)

Upvotes: 2

CUDA computing a histogram with shared memory

Answers (2)

Related Questions