Reputation: 1061
I have a large two dimensional array arr
which I would like to bin over the second axis using numpy. Because np.histogram
flattens the array I'm currently using a for loop:
import numpy as np
arr = np.random.randn(100, 100)
nbins = 10
binned = np.empty((arr.shape[0], nbins))
for i in range(arr.shape[0]):
binned[i,:] = np.histogram(arr[i,:], bins=nbins)[0]
I feel like there should be a more direct and more efficient way to do that within numpy but I failed to find one.
Upvotes: 18
Views: 9123
Reputation: 215
For pages of many, many, many small data series I think you can do a lot faster using something like numpy.digitize
(like a lot faster). Here is an example with 5000 data series, each featuring a modest 50 data points and targeting as few as 10 discrete bin locations. The speedup in this case is about ~an order of magnitude compared to the np.apply_along_axis
implementation. The implementation looks like:
def histograms( data, bin_edges ):
indices = np.digitize(data, bin_edges)
histograms = np.zeros((data.shape[0], len(bin_edges)-1))
for i,index in enumerate(np.unique(indices)):
histograms[:, i]= np.sum( indices==index, axis=1 )
return histograms
And here are some timings and verification:
data = np.random.rand(5000, 50)
bin_edges = np.linspace(0, 1, 11)
t1 = time.perf_counter()
h1 = histograms( data, bin_edges )
t2 = time.perf_counter()
print('digitize ', 1000*(t2-t1)/10., 'ms')
t1 = time.perf_counter()
h2 = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_edges)[0], 1, data)
t2 = time.perf_counter()
print('numpy ', 1000*(t2-t1)/10., 'ms')
assert np.allclose(h1, h2)
The result is something like this:
digitize 1.690 ms
numpy 15.08 ms
Cheers.
Upvotes: 4
Reputation: 326
To bin a numpy array along any axis you may use :
def bin_nd_data(arr, bin_n = 2, axis = -1):
""" bin a nD array along one specific axis, to check.."""
ss = list( arr.shape )
if ss[axis]%bin_n==0:
ss[ axis ] = int( ss[axis]/bin_n)
print('ss is ', ss )
if axis==-1:
ss.append( bin_n)
return np.mean( np.reshape(arr, ss, order='F' ), axis=-1 )
else:
ss.insert( axis+1, bin_n )
return np.mean( np.reshape(arr, ss, order='F' ), axis=axis+1 )
else:
print('bin nd data, not divisible bin given : array shape :', arr.shape, ' bin ', bin_n)
return None
It is a slight bother to take into account the case 'axis=-1'.
Upvotes: 0
Reputation: 1382
I was a bit confused by the lambda in Ami's solution so I expanded it out to show what it's doing:
def hist_1d(a):
return np.histogram(a, bins=bins)[0]
counts = np.apply_along_axis(hist_1d, axis=1, arr=x)
Upvotes: 2
Reputation: 339
You have to use numpy.histogramdd specifically meant for your problem
Upvotes: -7
Reputation: 76297
You could use np.apply_along_axis
:
x = np.array([range(20), range(1, 21), range(2, 22)])
nbins = 2
>>> np.apply_along_axis(lambda a: np.histogram(a, bins=nbins)[0], 1, x)
array([[10, 10],
[10, 10],
[10, 10]])
The main advantage (if any) is that it's slightly shorter, but I wouldn't expect much of a performance gain. It's possibly marginally more efficient in the assembly of the per-row results.
Upvotes: 18