user3556757
user3556757

Reputation: 3609

vectorizing numpy bincount

I have a 2d numpy array., A I want to apply np.bincount() to each column of the matrix A to generate another 2d array B that is composed of the bincounts of each column of the original matrix A.

My problem is that np.bincount() is a function that takes a 1d array-like. It's not an array method like B = A.max(axis=1) for example.

Is there a more pythonic/numpythic way to generate this B array other than a nasty for-loop?

import numpy as np

states = 4
rows = 8
cols = 4

A = np.random.randint(0,states,(rows,cols))
B = np.zeros((states,cols))

for x in range(A.shape[1]):
    B[:,x] =  np.bincount(A[:,x])

Upvotes: 8

Views: 2467

Answers (4)

Divakar
Divakar

Reputation: 221524

Using the same philosophy as in this post, here's a vectorized approach -

m = A.shape[1]    
n = A.max()+1
A1 = A + (n*np.arange(m))
out = np.bincount(A1.ravel(),minlength=n*m).reshape(m,-1).T

Upvotes: 6

Warren Weckesser
Warren Weckesser

Reputation: 114811

Yet another possibility:

import numpy as np


def bincount_columns(x, minlength=None):
    nbins = x.max() + 1
    if minlength is not None:
        nbins = max(nbins, minlength)
    ncols = x.shape[1]
    count = np.zeros((nbins, ncols), dtype=int)
    colidx = np.arange(ncols)[None, :]
    np.add.at(count, (x, colidx), 1)
    return count

For example,

In [110]: x
Out[110]: 
array([[4, 2, 2, 3],
       [4, 3, 4, 4],
       [4, 3, 4, 4],
       [0, 2, 4, 0],
       [4, 1, 2, 1],
       [4, 2, 4, 3]])

In [111]: bincount_columns(x)
Out[111]: 
array([[1, 0, 0, 1],
       [0, 1, 0, 1],
       [0, 3, 2, 0],
       [0, 2, 0, 2],
       [5, 0, 4, 2]])

In [112]: bincount_columns(x, minlength=7)
Out[112]: 
array([[1, 0, 0, 1],
       [0, 1, 0, 1],
       [0, 3, 2, 0],
       [0, 2, 0, 2],
       [5, 0, 4, 2],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])

Upvotes: 1

Eelco Hoogendoorn
Eelco Hoogendoorn

Reputation: 10759

This solution using the numpy_indexed package (disclaimer: I am its author) is fully vectorized, thus does not include any python loops behind the scenes. Also, there are no restrictions on the input; not every column needs to contain the same set of unique values.

import numpy_indexed as npi
rowidx, colidx = np.indices(A.shape)
(bin, col), B = npi.count_table(A.flatten(), colidx.flatten())

This gives an alternative (sparse) representation of the same result, which may be much more appropriate if the B array does indeed contain many zeros:

(bin, col), count = npi.count((A.flatten(), colidx.flatten()))

Note that apply_along_axis is just syntactic sugar for a for-loop, and has the same performance characteristics.

Upvotes: 2

jotasi
jotasi

Reputation: 5177

I would suggest to use np.apply_along_axis, which will allow you to apply a 1D-method (in this case np.bincount) to 1D slices of a higher dimensional array:

import numpy as np

states = 4
rows = 8
cols = 4

A = np.random.randint(0,states,(rows,cols))
B = np.zeros((states,cols))

B = np.apply_along_axis(np.bincount, axis=0, arr=A)

You'll have to be careful, though. This (as well as your suggested for-loop) only works if the output of np.bincount has the right shape. If the maximum state is not present in one or multiple columns of your array A, the output will not have a smaller dimensionality and thus, the code will file with a ValueError.

Upvotes: 2

Related Questions