Reputation: 33
I have a 2D uint16 numpy array, I want to calculate the histogram for this array. The function I use is:
def calc_hist(source):
hist = np.zeros(2**16, dtype='uint16')
for i in range(source.shape[0]):
for j in range(source.shape[1]):
hist[source[i, j]] = hist[source[i, j]] + 1
return hist
This function takes too much time to execute. As I understand, there's a histogram function in the numpy module but I cant figure how to use it. I've tried:
hist,_ = np.histogram(source.flatten(), bins=range(2**16))
But I get different results then my own function. How can I call numpy.histogram to achieve the same result? or is there any other options?
Upvotes: 1
Views: 744
Reputation: 114801
For an input with data type uint16
, numpy.bincount
should work well:
hist = np.bincount(source.ravel(), minlength=2**16)
Your function is doing almost exactly what bincount
does, but bincount
is implemented in C.
For example, the following checks that this use of bincount
gives the same result as your calc_hist
function:
In [159]: rng = np.random.default_rng()
In [160]: x = rng.integers(0, 2**16, size=(1000, 1000))
In [161]: h1 = calc_hist(x)
In [162]: h2 = np.bincount(x.ravel(), minlength=2**16)
In [163]: (h1 == h2).all() # Verify that they are the same.
Out[163]: True
Check the performance with ipython %timeit
command. You can see that using bincount
is much faster.
In [164]: %timeit calc_hist(x)
2.66 s ± 21.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [165]: %timeit np.bincount(x.ravel(), minlength=2**16)
3.13 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 4
Reputation: 41
As Corley Brigman pointed out, passing bins=range(x)
determines the bin edges [1]. So, you will end up with x-1 bins with corresponding edges [0, 1), [1, 2), ..., [x-1, x].
In your case, you will have 2^16-1 bins. To fix it, simply use range(2**16+1)
.
Upvotes: 1