The Dude
The Dude

Reputation: 4005

Determine sum of numpy array while excluding certain values

I would like to determine the sum of a two dimensional numpy array. However, elements with a certain value I want to exclude from this summation. What is the most efficient way to do this?

For example, here I initialize a two dimensional numpy array of 1s and replace several of them by 2:

import numpy

data_set = numpy.ones((10, 10))

data_set[4][4] = 2
data_set[5][5] = 2
data_set[6][6] = 2

How can I sum over the elements in my two dimensional array while excluding all of the 2s? Note that with the 10 by 10 array the correct answer should be 97 as I replaced three elements with the value 2.

I know I can do this with nested for loops. For example:

elements = []
for idx_x in range(data_set.shape[0]):
  for idx_y in range(data_set.shape[1]):
    if data_set[idx_x][idx_y] != 2:
      elements.append(data_set[idx_x][idx_y])

data_set_sum = numpy.sum(elements)

However on my actual data (which is very large) this is too slow. What is the correct way of doing this?

Upvotes: 4

Views: 18462

Answers (4)

Johnus
Johnus

Reputation: 720

Using np.sums where= argument, we avoid the need for array copying which would otherwise be triggered from using advanced array indexing:

>>> import numpy as np
>>> data_set = np.ones((10,10))
>>> data_set[(4,5,6),(4,5,6)] = 2
>>> np.sum(data_set, where=data_set != 2)
97.0
>>> data_set.sum(where=data_set != 2)
97.0

https://numpy.org/doc/stable/reference/generated/numpy.sum.html

Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).

https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing

Upvotes: 1

Jay M
Jay M

Reputation: 4297

How about this way that makes use of numpy's boolean capabilities.

We simply set all the values that meet the specification to zero before taking the sum, that way we don't change the shape of the array as we would if we were to filter them from the array.

The other benefit of this is that it means we can sum along axis after the filter is applied.

import numpy

data_set = numpy.ones((10, 10))

data_set[4][4] = 2
data_set[5][5] = 2
data_set[6][6] = 2

print "Sum", data_set.sum()

another_set = numpy.array(data_set) # Take a copy, we'll need that later

data_set[data_set == 2] = 0  # Set all the values that are 2 to zero
print "Filtered sum", data_set.sum()
print "Along axis", data_set.sum(0), data_set.sum(1)

Equally we could use any other boolean to set the data we wish to exclude from the sum.

another_set[(another_set > 1) & (another_set < 3)] = 0
print "Another filtered sum", another_set.sum()

Upvotes: 0

tktk
tktk

Reputation: 11734

Use numpy's capability of indexing with boolean arrays. In the below example data_set!=2 evaluates to a boolean array which is True whenever the element is not 2 (and has the correct shape). So data_set[data_set!=2] is a fast and convenient way to get an array which doesn't contain a certain value. Of course, the boolean expression can be more complex.

In [1]: import numpy as np
In [2]: data_set = np.ones((10, 10))
In [4]: data_set[4,4] = 2
In [5]: data_set[5,5] = 2
In [6]: data_set[6,6] = 2
In [7]: data_set[data_set != 2].sum()
Out[7]: 97.0
In [8]: data_set != 2
Out[8]: 
array([[ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True],
       [ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True],
       ...
       [ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True]], dtype=bool)

Upvotes: 12

njzk2
njzk2

Reputation: 39406

Without numpy, the solution is not much more complex:

x = [1,2,3,4,5,6,7]
sum(y for y in x if y != 7)
# 21

Works for a list of excluded values too:

# set is faster for resolving `in`
exl = set([1,2,3])
sum(y for y in x if y not in exl)
# 22

Upvotes: 5

Related Questions