obachtos
obachtos

Reputation: 1061

Binning of data along one axis in numpy

I have a large two dimensional array arr which I would like to bin over the second axis using numpy. Because np.histogram flattens the array I'm currently using a for loop:

import numpy as np

arr = np.random.randn(100, 100)

nbins = 10
binned = np.empty((arr.shape[0], nbins))

for i in range(arr.shape[0]):
    binned[i,:] = np.histogram(arr[i,:], bins=nbins)[0]

I feel like there should be a more direct and more efficient way to do that within numpy but I failed to find one.

Upvotes: 18

Views: 9123

Answers (5)

Axiomel
Axiomel

Reputation: 215

For pages of many, many, many small data series I think you can do a lot faster using something like numpy.digitize (like a lot faster). Here is an example with 5000 data series, each featuring a modest 50 data points and targeting as few as 10 discrete bin locations. The speedup in this case is about ~an order of magnitude compared to the np.apply_along_axis implementation. The implementation looks like:

def histograms( data, bin_edges ):
    indices = np.digitize(data, bin_edges)
    histograms = np.zeros((data.shape[0], len(bin_edges)-1))
    for i,index in enumerate(np.unique(indices)):
        histograms[:, i]= np.sum( indices==index, axis=1 )
    return histograms

And here are some timings and verification:

data = np.random.rand(5000, 50)
bin_edges = np.linspace(0, 1, 11)

t1 = time.perf_counter()
h1 = histograms( data, bin_edges )
t2 = time.perf_counter()
print('digitize ', 1000*(t2-t1)/10., 'ms')

t1 = time.perf_counter()
h2 = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_edges)[0], 1, data)
t2 = time.perf_counter()
print('numpy    ', 1000*(t2-t1)/10., 'ms')

assert np.allclose(h1, h2)

The result is something like this:

digitize  1.690 ms
numpy     15.08 ms

Cheers.

Upvotes: 4

Adrien Mau
Adrien Mau

Reputation: 326

To bin a numpy array along any axis you may use :

def bin_nd_data(arr, bin_n = 2, axis = -1):
    """ bin a nD array along one specific axis, to check.."""
    ss = list( arr.shape )
    if ss[axis]%bin_n==0:
        ss[ axis ] = int( ss[axis]/bin_n)
        print('ss is ', ss )
        if axis==-1:
            ss.append( bin_n)
            return np.mean( np.reshape(arr, ss, order='F' ), axis=-1 )
        else:
            ss.insert( axis+1, bin_n )
            return np.mean( np.reshape(arr, ss, order='F' ), axis=axis+1 )
        
    else:
        print('bin nd data, not divisible bin given : array shape :', arr.shape, ' bin ', bin_n)
        return None

It is a slight bother to take into account the case 'axis=-1'.

Upvotes: 0

ThomasNicholas
ThomasNicholas

Reputation: 1382

I was a bit confused by the lambda in Ami's solution so I expanded it out to show what it's doing:

def hist_1d(a):
    return np.histogram(a, bins=bins)[0]

counts = np.apply_along_axis(hist_1d, axis=1, arr=x)

Upvotes: 2

Arpan Das
Arpan Das

Reputation: 339

You have to use numpy.histogramdd specifically meant for your problem

Upvotes: -7

Ami Tavory
Ami Tavory

Reputation: 76297

You could use np.apply_along_axis:

x = np.array([range(20), range(1, 21), range(2, 22)])

nbins = 2
>>> np.apply_along_axis(lambda a: np.histogram(a, bins=nbins)[0], 1, x)
array([[10, 10],
       [10, 10],
       [10, 10]])

The main advantage (if any) is that it's slightly shorter, but I wouldn't expect much of a performance gain. It's possibly marginally more efficient in the assembly of the per-row results.

Upvotes: 18

Related Questions