Ling Guo
Ling Guo

Reputation: 592

Selecting rows in numpy array

I have a numpy array (mat) of shape (n,4). The array has four columns and large number (n) of rows. The first three columns represent x, y, z columns in my calculation. I wish to select those rows of the numpy array where the x column has values below a given number (min_x) or values above a given number (max_x), and where the y column has values below a given number (min_y) or values above a given number (max_y) and where the z column has values below a given number (min_z) or values above a given number (max_z).

This is how I am trying to implement this desired functionality presently:

import numpy as np

mark = np.where( ( (mat[:,0]<=min_x) | \
            (mat[:,0]>max_x) ) & \
                 ( (mat[:,1]<=min_y) | \
            (mat[:,1]>max_y) ) & \
                 ( (mat[:,2]<=min_z) | \
            (mat[:,2]>max_z) ) )

mat_new = mat[:,mark[0]]

Is the technique that I am using correct, and the best way to achieve the desired functionality? I will greatly appreciate any help. Thanks.

Upvotes: 3

Views: 1946

Answers (3)

Brad Solomon
Brad Solomon

Reputation: 40878

What you have now looks fine. But since you are asking about other ways to achieve the desired functionality: you can create a 1-dimensional boolean mask that is either True or False for each row index. Here is an example.

>>> import numpy as np
>>> np.random.seed(444)

>>> shape = 15, 4
>>> mat = np.random.randint(low=0, high=10, size=shape)
>>> mat
array([[3, 0, 7, 8],
       [3, 4, 7, 6],
       [8, 9, 2, 2],
       [2, 0, 3, 8],
       [0, 6, 6, 0],
       [3, 0, 6, 7],
       [9, 3, 8, 7],
       [3, 2, 6, 9],
       [2, 9, 8, 9],
       [3, 2, 2, 8],
       [1, 5, 6, 7],
       [6, 0, 0, 0],
       [0, 4, 8, 1],
       [9, 8, 5, 8],
       [9, 4, 6, 6]])

# The thresholds for x, y, z, respectively
>>> lower = np.array([5, 5, 4])
>>> upper = np.array([6, 6, 7])
>>> idx = len(lower)
# Parentheses are required here.  NumPy boolean ops use | and &
# which have different operator precedence than `or` and `and`
>>> mask = np.all((mat[:, :idx] < lower) | (mat[:, :idx] > upper), axis=1)

>>> mask
array([False, False,  True,  True, False, False,  True, False,  True,
        True, False, False,  True, False, False])

Now indexing mat by mask will constrain it to row indices where mask is True:

>>> mat[mask]
array([[8, 9, 2, 2],
       [2, 0, 3, 8],
       [9, 3, 8, 7],
       [2, 9, 8, 9],
       [3, 2, 2, 8],
       [0, 4, 8, 1]])

What is a bit different about this approach is that it is scalable: instead of specifying each coordinate condition individually, you can specify them in two arrays, one for the upper threshold and one for the lower, and then take advantage of NumPy's vectorization & broadcasting to build the mask.

np.all() says, test that all values are True, row-wise. It captures the "and" conditions from your question, while the | operator captures the "or".

Upvotes: 3

kevinkayaks
kevinkayaks

Reputation: 2726

I'd just drop the np.where and use the boolean mask instead

x,y,z,_ = mat.T
mask = ( ( (x <= min_x) | (x > max_x) ) &
         ( (y <= min_y) | (y > max_y) ) &
         ( (z <= min_z) | (z > max_z) ) ) 
mat_new = mat[mask]

Upvotes: 2

DYZ
DYZ

Reputation: 57033

Looks good to me. You can make it a bit more compact by comparing the columns to the midrange values:

mark = (np.abs(mat[:,0] - (max_x + min_x) / 2) > (max_x - min_x) / 2) &
       (np.abs(mat[:,1] - (max_y + min_y) / 2) > (max_y - min_y) / 2) &
       (np.abs(mat[:,2] - (max_z + min_z) / 2) > (max_z - min_z) / 2)

Unfortunately, you cannot control the precise boundary conditions (< vs <=) anymore. Also, this is probably the slowest solution, even slower than the original one.

Upvotes: 3

Related Questions