Reputation: 417
I am seeing behaviour with numpy bincount that I cannot make sense of. I want to bin the values in a 2D array in a row-wise manner and see the behaviour below. Why would it work with dbArray but fail with simarray?
>>> dbArray
array([[1, 0, 1, 0, 1],
[1, 1, 1, 1, 1],
[1, 1, 0, 1, 1],
[1, 0, 0, 0, 0],
[0, 0, 0, 1, 1],
[0, 1, 0, 1, 0]])
>>> N.apply_along_axis(N.bincount,1,dbArray)
array([[2, 3],
[0, 5],
[1, 4],
[4, 1],
[3, 2],
[3, 2]], dtype=int64)
>>> simarray
array([[2, 0, 2, 0, 2],
[2, 1, 2, 1, 2],
[2, 1, 1, 1, 2],
[2, 0, 1, 0, 1],
[1, 0, 1, 1, 2],
[1, 1, 1, 1, 1]])
>>> N.apply_along_axis(N.bincount,1,simarray)
Traceback (most recent call last):
File "<pyshell#31>", line 1, in <module>
N.apply_along_axis(N.bincount,1,simarray)
File "C:\Python27\lib\site-packages\numpy\lib\shape_base.py", line 118, in apply_along_axis
outarr[tuple(i.tolist())] = res
ValueError: could not broadcast input array from shape (2) into shape (3)
Upvotes: 26
Views: 21897
Reputation: 109
This is a function that does exactly what you want:
Without any loops.
Generalized to any dimention.
import numpy as np
def sub_sum_partition(a, partition):
"""
Generalization of np.bincount(partition, a).
Sums rows of a matrix for each value of array of non-negative ints.
:param a: array_like
:param partition: array_like, 1 dimension, nonnegative ints
:return: matrix of shape ('one larger than the largest value in partition', a.shape[1:]). The i's element is
the sum of rows j in 'a' s.t. partition[j] == i
Example:
>>> mat = np.array([[1, 0, 1, 0, 1],
... [1, 1, 1, 1, 1],
... [1, 1, 0, 1, 1],
... [1, 0, 0, 0, 0],
... [0, 0, 0, 1, 1],
... [0, 1, 0, 1, 0]])
>>> par = np.array([2, 2, 3, 3, 2, 2])
>>> ret = sub_sum_partition(mat, par)
>>> ret
array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[2., 2., 2., 3., 3.],
[2., 1., 0., 1., 1.]])
>>> i = 0; assert np.all(mat[par == i].sum(axis=0) == ret[i])
>>> i = 1; assert np.all(mat[par == i].sum(axis=0) == ret[i])
>>> i = 2; assert np.all(mat[par == i].sum(axis=0) == ret[i])
>>> i = 3; assert np.all(mat[par == i].sum(axis=0) == ret[i])
3D Example:
>>> mat_3d = np.array([[[1, 2], [3, 4]],
... [[5, 6], [7, 8]],
... [[9, 10], [11, 12]],
... [[13, 14], [15, 16]]])
>>> par_3d = np.array([0, 1, 0, 1])
>>> ret_3d = sub_sum_partition(mat_3d, par_3d)
>>> ret_3d
array([[[10., 12.],
[14., 16.]],
<BLANKLINE>
[[18., 20.],
[22., 24.]]])
>>> i = 0; assert np.all(mat_3d[par_3d == i].sum(axis=0) == ret_3d[i])
>>> i = 1; assert np.all(mat_3d[par_3d == i].sum(axis=0) == ret_3d[i])
"""
assert partition.shape == (len(a),)
n = np.prod(a.shape[1:], dtype=int)
bins = ((np.tile(partition, (n, 1)) * n).T + np.arange(n, dtype=int)).reshape(-1)
sums = np.bincount(bins, a.reshape(-1))
if n > 1:
sums = sums.reshape(-1, *a.shape[1:])
return sums
Upvotes: 3
Reputation: 18141
Here's an implementation of bincount
that works on ND arrays, performing the counts along the last axis of the array. (Inspired by the 3D version posted here by @winwin.)
Some notes:
maxbin
, in which case the high values are discarded.def bincount_last_axis(a, maxbin=None):
"""
Like np.bincount, but works for ND arrays.
The bincount is performed along the last axis of the input.
Args:
a:
ndarray, with unsigned integer contents
maxbin:
The highest bin value to return.
Determines the length of the last axis in the results.
If the input array contains higher values than this,
those values will not be counted in the results.
Returns:
ndarray
The dtype of the output will be the minimum unsigned type that is
guaranteed to accommodate the counts, based on the length of the input.
If the input shape is:
(S0, S1, S2, ..., S(N-1), SN),
then the result shape will be:
(S0, S1, S2, ..., S(N-1), maxbin+1)
Author: https://github.com/stuarteberg
"""
a = np.asarray(a)
assert maxbin is None or maxbin >= -1
assert np.issubdtype(a.dtype, np.integer)
if np.issubdtype(a.dtype, np.signedinteger):
assert a.min() >= 0, "Can't operate on negative values"
maxval = a.max()
if maxbin is None:
num_bins = maxval + 1
elif maxval <= maxbin:
num_bins = maxbin + 1
else:
# Leave one extra bin for all values above maxbin,
# and force all high input values into that bin,
# which we will discard before returning.
num_bins = maxbin + 2
a = np.clip(a, 0, maxbin + 1)
# Flat counts
counts_dtype = np.min_scalar_type(num_bins-1)
counts = np.zeros((a.size // a.shape[-1]) * num_bins, dtype=counts_dtype)
# Calculate flat indexes into the 'counts' array
index_dtype = np.min_scalar_type(counts.size)
i = np.arange(a.size, dtype=index_dtype) // a.shape[-1]
i *= num_bins
i += a.reshape(-1).astype(index_dtype)
# Perform the bincounts
np.add.at(counts, i, 1)
# Reshape back to ND
counts = counts.reshape((*a.shape[:-1], num_bins))
if maxbin is None or maxval <= maxbin:
return counts
# Discard the extra bin we used for above-max values.
return counts[..., :-1]
Example usage:
In [144]: a = np.random.randint(3, size=(2,3,4))
...: print(a)
[[[1 2 1 0]
[0 1 2 0]
[0 1 2 2]]
[[1 2 1 1]
[0 1 2 0]
[0 1 2 1]]]
In [145]: bincount_last_axis(a)
Out[145]:
array([[[1, 2, 1],
[2, 1, 1],
[1, 1, 2]],
[[0, 3, 1],
[2, 1, 1],
[1, 2, 1]]], dtype=uint8)
In [146]: bincount_last_axis(a, 10)
Out[146]:
array([[[1, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0]],
[[0, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0]]], dtype=uint8)
In [147]: bincount_last_axis(a, 1)
Out[147]:
array([[[1, 2],
[2, 1],
[1, 1]],
[[0, 3],
[2, 1],
[1, 2]]], dtype=uint8)
Upvotes: 1
Reputation: 1797
As @DSM has already mentioned, bincount of a 2d array cannot be done without knowing the maximum value of the array, because it would mean an inconsistency of array sizes.
But thanks to the power of numpy's indexing, it was fairly easy to make a faster implementation of 2d bincount, as it doesn't use concatenation or anything.
def bincount2d(arr, bins=None):
if bins is None:
bins = np.max(arr) + 1
count = np.zeros(shape=[len(arr), bins], dtype=np.int64)
indexing = (np.ones_like(arr).T * np.arange(len(arr))).T
np.add.at(count, (indexing, arr), 1)
return count
UPD: Here is the 3D version of it. Just dug it up from some old code of mine:
def bincount3d(arr, bins=None):
if bins is None:
bins = np.max(arr) + 1
count = np.zeros(shape=[arr.shape[0], arr.shape[1], bins], dtype=np.int64)
index2d = np.ones_like(arr) * np.reshape(np.arange(arr.shape[1]), newshape=[1, arr.shape[1], 1])
index3d = np.ones_like(arr) * np.reshape(np.arange(arr.shape[0]), newshape=[arr.shape[0], 1, 1])
np.add.at(count, (index3d, index2d, arr), 1)
return count
Upvotes: 6
Reputation: 353019
The problem is that bincount
isn't always returning the same shaped objects, in particular when values are missing. For example:
>>> m = np.array([[0,0,1],[1,1,0],[1,1,1]])
>>> np.apply_along_axis(np.bincount, 1, m)
array([[2, 1],
[1, 2],
[0, 3]])
>>> [np.bincount(m[i]) for i in range(m.shape[1])]
[array([2, 1]), array([1, 2]), array([0, 3])]
works, but:
>>> m = np.array([[0,0,0],[1,1,0],[1,1,0]])
>>> m
array([[0, 0, 0],
[1, 1, 0],
[1, 1, 0]])
>>> [np.bincount(m[i]) for i in range(m.shape[1])]
[array([3]), array([1, 2]), array([1, 2])]
>>> np.apply_along_axis(np.bincount, 1, m)
Traceback (most recent call last):
File "<ipython-input-49-72e06e26a718>", line 1, in <module>
np.apply_along_axis(np.bincount, 1, m)
File "/usr/local/lib/python2.7/dist-packages/numpy/lib/shape_base.py", line 117, in apply_along_axis
outarr[tuple(i.tolist())] = res
ValueError: could not broadcast input array from shape (2) into shape (1)
won't.
You could use the minlength
parameter and pass it using a lambda
or partial
or something:
>>> np.apply_along_axis(lambda x: np.bincount(x, minlength=2), axis=1, arr=m)
array([[3, 0],
[1, 2],
[1, 2]])
Upvotes: 26