Dan
Dan

Reputation: 33

Is there a better way of implementing a histogram?

I have a 2D uint16 numpy array, I want to calculate the histogram for this array. The function I use is:

def calc_hist(source):
    hist = np.zeros(2**16, dtype='uint16')
    for i in range(source.shape[0]):
       for j in range(source.shape[1]):
           hist[source[i, j]] = hist[source[i, j]] + 1
    return hist 

This function takes too much time to execute. As I understand, there's a histogram function in the numpy module but I cant figure how to use it. I've tried:

hist,_ = np.histogram(source.flatten(), bins=range(2**16))

But I get different results then my own function. How can I call numpy.histogram to achieve the same result? or is there any other options?

Upvotes: 1

Views: 744

Answers (2)

Warren Weckesser
Warren Weckesser

Reputation: 114801

For an input with data type uint16, numpy.bincount should work well:

hist = np.bincount(source.ravel(), minlength=2**16)

Your function is doing almost exactly what bincount does, but bincount is implemented in C.

For example, the following checks that this use of bincount gives the same result as your calc_hist function:

In [159]: rng = np.random.default_rng()

In [160]: x = rng.integers(0, 2**16, size=(1000, 1000))                                      

In [161]: h1 = calc_hist(x)

In [162]: h2 = np.bincount(x.ravel(), minlength=2**16)

In [163]: (h1 == h2).all()  # Verify that they are the same.
Out[163]: True

Check the performance with ipython %timeit command. You can see that using bincount is much faster.

In [164]: %timeit calc_hist(x)
2.66 s ± 21.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [165]: %timeit np.bincount(x.ravel(), minlength=2**16)
3.13 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 4

Martin Langbecker
Martin Langbecker

Reputation: 41

As Corley Brigman pointed out, passing bins=range(x) determines the bin edges [1]. So, you will end up with x-1 bins with corresponding edges [0, 1), [1, 2), ..., [x-1, x].

In your case, you will have 2^16-1 bins. To fix it, simply use range(2**16+1).

[1] https://numpy.org/doc/stable/reference/generated/numpy.histogram.html?highlight=histogram#numpy.histogram

Upvotes: 1

Related Questions