James
James

Reputation: 417

Can numpy bincount work with 2D arrays?

I am seeing behaviour with numpy bincount that I cannot make sense of. I want to bin the values in a 2D array in a row-wise manner and see the behaviour below. Why would it work with dbArray but fail with simarray?

>>> dbArray
array([[1, 0, 1, 0, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 0, 1, 1],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 1, 1],
       [0, 1, 0, 1, 0]])
>>> N.apply_along_axis(N.bincount,1,dbArray)
array([[2, 3],
       [0, 5],
       [1, 4],
       [4, 1],
       [3, 2],
       [3, 2]], dtype=int64)
>>> simarray
array([[2, 0, 2, 0, 2],
       [2, 1, 2, 1, 2],
       [2, 1, 1, 1, 2],
       [2, 0, 1, 0, 1],
       [1, 0, 1, 1, 2],
       [1, 1, 1, 1, 1]])
>>> N.apply_along_axis(N.bincount,1,simarray)

Traceback (most recent call last):
  File "<pyshell#31>", line 1, in <module>
    N.apply_along_axis(N.bincount,1,simarray)
  File "C:\Python27\lib\site-packages\numpy\lib\shape_base.py", line 118, in apply_along_axis
    outarr[tuple(i.tolist())] = res
ValueError: could not broadcast input array from shape (2) into shape (3)

Upvotes: 26

Views: 21897

Answers (4)

Evyatar Cohen
Evyatar Cohen

Reputation: 109

This is a function that does exactly what you want:

  1. Without any loops.

  2. Generalized to any dimention.

    import numpy as np
    
    def sub_sum_partition(a, partition):
        """
        Generalization of np.bincount(partition, a).
        Sums rows of a matrix for each value of array of non-negative ints.
    
        :param a: array_like
        :param partition: array_like, 1 dimension, nonnegative ints
        :return: matrix of shape ('one larger than the largest value in partition', a.shape[1:]). The i's element is
        the sum of rows j in 'a' s.t. partition[j] == i
    
        Example:
        >>> mat = np.array([[1, 0, 1, 0, 1],
        ...                 [1, 1, 1, 1, 1],
        ...                 [1, 1, 0, 1, 1],
        ...                 [1, 0, 0, 0, 0],
        ...                 [0, 0, 0, 1, 1],
        ...                 [0, 1, 0, 1, 0]])
        >>> par = np.array([2, 2, 3, 3, 2, 2])
        >>> ret = sub_sum_partition(mat, par)
        >>> ret
        array([[0., 0., 0., 0., 0.],
               [0., 0., 0., 0., 0.],
               [2., 2., 2., 3., 3.],
               [2., 1., 0., 1., 1.]])
        >>> i = 0; assert np.all(mat[par == i].sum(axis=0) == ret[i])
        >>> i = 1; assert np.all(mat[par == i].sum(axis=0) == ret[i])
        >>> i = 2; assert np.all(mat[par == i].sum(axis=0) == ret[i])
        >>> i = 3; assert np.all(mat[par == i].sum(axis=0) == ret[i])
    
        3D Example:
        >>> mat_3d = np.array([[[1, 2], [3, 4]],
        ...                    [[5, 6], [7, 8]],
        ...                    [[9, 10], [11, 12]],
        ...                    [[13, 14], [15, 16]]])
        >>> par_3d = np.array([0, 1, 0, 1])
        >>> ret_3d = sub_sum_partition(mat_3d, par_3d)
        >>> ret_3d
        array([[[10., 12.],
                [14., 16.]],
        <BLANKLINE>
               [[18., 20.],
                [22., 24.]]])
        >>> i = 0; assert np.all(mat_3d[par_3d == i].sum(axis=0) == ret_3d[i])
        >>> i = 1; assert np.all(mat_3d[par_3d == i].sum(axis=0) == ret_3d[i])
        """
        assert partition.shape == (len(a),)
        n = np.prod(a.shape[1:], dtype=int)
        bins = ((np.tile(partition, (n, 1)) * n).T + np.arange(n, dtype=int)).reshape(-1)
        sums = np.bincount(bins, a.reshape(-1))
        if n > 1:
            sums = sums.reshape(-1, *a.shape[1:])
        return sums
    

Upvotes: 3

Stuart Berg
Stuart Berg

Reputation: 18141

Here's an implementation of bincount that works on ND arrays, performing the counts along the last axis of the array. (Inspired by the 3D version posted here by @winwin.)

Some notes:

  • This implementation flattens the input the array, to avoid allocating N index arrays for N-dimensional input
  • Strives to use minimal-width dtypes to save RAM.
  • Permits inputs with values above maxbin, in which case the high values are discarded.
def bincount_last_axis(a, maxbin=None):
    """
    Like np.bincount, but works for ND arrays.
    The bincount is performed along the last axis of the input.

    Args:
        a:
            ndarray, with unsigned integer contents
        maxbin:
            The highest bin value to return.
            Determines the length of the last axis in the results.
            If the input array contains higher values than this,
            those values will not be counted in the results.
    Returns:
        ndarray

        The dtype of the output will be the minimum unsigned type that is
        guaranteed to accommodate the counts, based on the length of the input.

        If the input shape is:

            (S0, S1, S2, ..., S(N-1), SN),

        then the result shape will be:

            (S0, S1, S2, ..., S(N-1), maxbin+1)

    Author: https://github.com/stuarteberg
    """
    a = np.asarray(a)
    assert maxbin is None or maxbin >= -1
    assert np.issubdtype(a.dtype, np.integer)
    if np.issubdtype(a.dtype, np.signedinteger):
        assert a.min() >= 0, "Can't operate on negative values"

    maxval = a.max()
    if maxbin is None:
        num_bins = maxval + 1
    elif maxval <= maxbin:
        num_bins = maxbin + 1
    else:
        # Leave one extra bin for all values above maxbin,
        # and force all high input values into that bin,
        # which we will discard before returning.
        num_bins = maxbin + 2
        a = np.clip(a, 0, maxbin + 1)

    # Flat counts
    counts_dtype = np.min_scalar_type(num_bins-1)
    counts = np.zeros((a.size // a.shape[-1]) * num_bins, dtype=counts_dtype)

    # Calculate flat indexes into the 'counts' array
    index_dtype = np.min_scalar_type(counts.size)
    i = np.arange(a.size, dtype=index_dtype) // a.shape[-1]
    i *= num_bins
    i += a.reshape(-1).astype(index_dtype)

    # Perform the bincounts
    np.add.at(counts, i, 1)

    # Reshape back to ND
    counts = counts.reshape((*a.shape[:-1], num_bins))

    if maxbin is None or maxval <= maxbin:
        return counts

    # Discard the extra bin we used for above-max values.
    return counts[..., :-1]

Example usage:

In [144]: a = np.random.randint(3, size=(2,3,4))
     ...: print(a)
[[[1 2 1 0]
  [0 1 2 0]
  [0 1 2 2]]

 [[1 2 1 1]
  [0 1 2 0]
  [0 1 2 1]]]

In [145]: bincount_last_axis(a)
Out[145]:
array([[[1, 2, 1],
        [2, 1, 1],
        [1, 1, 2]],

       [[0, 3, 1],
        [2, 1, 1],
        [1, 2, 1]]], dtype=uint8)

In [146]: bincount_last_axis(a, 10)
Out[146]:
array([[[1, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0]],

       [[0, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0]]], dtype=uint8)

In [147]: bincount_last_axis(a, 1)
Out[147]:
array([[[1, 2],
        [2, 1],
        [1, 1]],

       [[0, 3],
        [2, 1],
        [1, 2]]], dtype=uint8)

Upvotes: 1

winwin
winwin

Reputation: 1797

As @DSM has already mentioned, bincount of a 2d array cannot be done without knowing the maximum value of the array, because it would mean an inconsistency of array sizes.

But thanks to the power of numpy's indexing, it was fairly easy to make a faster implementation of 2d bincount, as it doesn't use concatenation or anything.

def bincount2d(arr, bins=None):
    if bins is None:
        bins = np.max(arr) + 1
    count = np.zeros(shape=[len(arr), bins], dtype=np.int64)
    indexing = (np.ones_like(arr).T * np.arange(len(arr))).T
    np.add.at(count, (indexing, arr), 1)

    return count

UPD: Here is the 3D version of it. Just dug it up from some old code of mine:

def bincount3d(arr, bins=None):
    if bins is None:
        bins = np.max(arr) + 1
    count = np.zeros(shape=[arr.shape[0], arr.shape[1], bins], dtype=np.int64)
    index2d = np.ones_like(arr) * np.reshape(np.arange(arr.shape[1]), newshape=[1, arr.shape[1], 1])
    index3d = np.ones_like(arr) * np.reshape(np.arange(arr.shape[0]), newshape=[arr.shape[0], 1, 1])
    np.add.at(count, (index3d, index2d, arr), 1)

    return count

Upvotes: 6

DSM
DSM

Reputation: 353019

The problem is that bincount isn't always returning the same shaped objects, in particular when values are missing. For example:

>>> m = np.array([[0,0,1],[1,1,0],[1,1,1]])
>>> np.apply_along_axis(np.bincount, 1, m)
array([[2, 1],
       [1, 2],
       [0, 3]])
>>> [np.bincount(m[i]) for i in range(m.shape[1])]
[array([2, 1]), array([1, 2]), array([0, 3])]

works, but:

>>> m = np.array([[0,0,0],[1,1,0],[1,1,0]])
>>> m
array([[0, 0, 0],
       [1, 1, 0],
       [1, 1, 0]])
>>> [np.bincount(m[i]) for i in range(m.shape[1])]
[array([3]), array([1, 2]), array([1, 2])]
>>> np.apply_along_axis(np.bincount, 1, m)
Traceback (most recent call last):
  File "<ipython-input-49-72e06e26a718>", line 1, in <module>
    np.apply_along_axis(np.bincount, 1, m)
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/shape_base.py", line 117, in apply_along_axis
    outarr[tuple(i.tolist())] = res
ValueError: could not broadcast input array from shape (2) into shape (1)

won't.

You could use the minlength parameter and pass it using a lambda or partial or something:

>>> np.apply_along_axis(lambda x: np.bincount(x, minlength=2), axis=1, arr=m)
array([[3, 0],
       [1, 2],
       [1, 2]])

Upvotes: 26

Related Questions