cabralpinto
cabralpinto

Reputation: 2098

Unexpected numpy sum behaviour with where parameter

As an example, have a look at these numpy arrays:

>>> a
array([[1, 2, 3], 
       [4, 5, 6]])
>>> b
array([[ True, False,  True],
       [False, False,  True],
       [ True,  True, False]])

Say I want the sum of each row of a including the elements specified in each row of b. Here's two instructions that do just that:

>>> np.sum(a[:,None] * b[None], 2)
array([[ 4,  3,  3],
       [10,  6,  9]])
>>> np.sum(np.where(b[None], a[:,None], 0), 2)
array([[ 4,  3,  3],
       [10,  6,  9]])

I usually use the first option, but recently found out np.sum has a where parameter, and would expect this to work:

>>> np.sum(a[:,None], 2, where=b[None])
array([[10],
       [25]])

But the result is different. I can see each row actually corresponds to the sum of the rows in the correct result.

I also found that when dimensions already match without broadcasting, the results using both methods are the same:

>>> a
array([[1, 2, 3], 
       [4, 5, 6]])
>>> b
array([[ True, False,  True],
       [False, False,  True]])
>>> np.sum(a * b, 1)
array([4, 6])
>>> np.sum(a, 1, where=b)
array([4, 6])

What is the explanation for this behaviour? Is there a way to prevent it, or should I stick to my previous method?

Upvotes: 1

Views: 251

Answers (1)

hpaulj
hpaulj

Reputation: 231530

So what you have been doing is make a (2,3,3) array, and summing on the last axis:

In [216]: np.where(b, a[:,None], 0)
Out[216]: 
array([[[1, 0, 3],
        [0, 0, 3],
        [1, 2, 0]],

       [[4, 0, 6],
        [0, 0, 6],
        [4, 5, 0]]])
In [217]: np.sum(_, axis=2)
Out[217]: 
array([[ 4,  3,  3],
       [10,  6,  9]])

If we replicate a to make a (2,3,3) array:

In [218]: A=a[:,None,:].repeat(3,1)
In [219]: A
Out[219]: 
array([[[1, 2, 3],
        [1, 2, 3],
        [1, 2, 3]],

       [[4, 5, 6],
        [4, 5, 6],
        [4, 5, 6]]])

We can sum where b is true (b broadcasts to (1,3,3) to (2,3,3):

In [221]: np.sum(A, where=b, axis=2)
Out[221]: 
array([[ 4,  3,  3],
       [10,  6,  9]])

This use of where is relatively new, and it too a big of trial and error to figure out how to do it. I don't know if it has any speed advantanges.

The where is simplest to use ufunc with 1 or 2 arguments, such as a divide or inverse, and we don't want it to calculate at the 0s. Then we specify an out array with default values. np.sum is np.add.reduce. That has a default 0 start, so the out isn't needed (or allowed)

where : array_like of bool, optional
    A boolean array which is broadcasted to match the dimensions
    of `array`, and selects elements to include in the reduction. Note
    that for ufuncs like ``minimum`` that do not have an identity
    defined, one has to pass in also ``initial``.

While b broadcasts to match A, A is not broadcastable, and thus has to be replicated.

In [231]: np.sum(a[:,None], where=b, axis=2)
Traceback (most recent call last):
  File "<ipython-input-231-b6e9e9179fac>", line 1, in <module>
    np.sum(a[:,None], where=b, axis=2)
  File "<__array_function__ internals>", line 5, in sum
  File "/usr/local/lib/python3.8/dist-packages/numpy/core/fromnumeric.py", line 2247, in sum
    return _wrapreduction(a, np.add, 'sum', axis, dtype, out, keepdims=keepdims,
  File "/usr/local/lib/python3.8/dist-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: non-broadcastable operand with shape (2,1,3) doesn't match the broadcast shape (2,3,3)

Upvotes: 1

Related Questions