Heinz
Heinz

Reputation: 2467

Compare elements in a numpy array 3 rows a time

I got a numpy array as below:

[[3.4, 87]
 [5.5, 11]
 [22, 3]
 [4, 9.8]
 [41, 11.22]
 [32, 7.6]]

and I want to:

  1. compare elements in column 2, 3 rows a time
  2. delete the row with the biggest value in column 2, 3 rows a time

For example, in the first 3 rows, 3 values in column 2 are 87, 11 and 3, respectively, and I would like to remain 11 and 3.

The output numpy array I expected would be:

[[5.5, 11]
 [22, 3]
 [4, 9.8]
 [32, 7.6]]

I am new to numpy array, and please give me advice to achieve this.

Upvotes: 0

Views: 302

Answers (1)

unutbu
unutbu

Reputation: 879291

import numpy as np
x = np.array([[3.4, 87],
              [5.5, 11],
              [22, 3],
              [4, 9.8],
              [41, 11.22],
              [32, 7.6]])

y = x.reshape(-1,3,2)
idx = y[..., 1].argmax(axis=1)
mask = np.arange(3)[None, :] != idx[:, None]
y = y[mask]
print(y)
# This might be helpful for the deleted part of your question
# y = y.reshape(-1,2,2)
# z = y[...,1]/y[...,1].sum(axis=1)
# result = np.dstack([y, z[...,None]])

yields

[[  5.5  11. ]
 [ 22.    3. ]
 [  4.    9.8]
 [ 32.    7.6]]

"Grouping by three" with NumPy can be done by reshaping the array to create a new axis of length 3 -- provided the original number of rows is divisible by 3:

In [92]: y = x.reshape(-1,3,2); y
Out[92]: 
array([[[  3.4 ,  87.  ],
        [  5.5 ,  11.  ],
        [ 22.  ,   3.  ]],

       [[  4.  ,   9.8 ],
        [ 41.  ,  11.22],
        [ 32.  ,   7.6 ]]])

In [93]: y.shape
Out[93]: (2, 3, 2)  
          |  |  |
          |  |  o--- 2 columns in each group
          |  o------ 3 rows in each group
          o--------- 2 groups

For each group, we can select the second column and find the row with the maximum value:

In [94]: idx = y[..., 1].argmax(axis=1); idx
Out[94]: array([0, 1])

array([0, 1]) indicates that in the first group, the 0th indexed row contains the maximum (i.e. 87), and in the second group, the 1st indexed row contains the maximum (i.e. 11.22).

Next, we can generate a 2D boolean selection mask which is True where the rows do not contain the maximum value:

In [95]: mask = np.arange(3)[None, :] != idx[:, None]; mask
Out[95]: 
array([[False,  True,  True],
       [ True, False,  True]], dtype=bool)

In [96]: mask.shape
Out[96]: (2, 3)

mask has shape (2,3). y has shape (2,3,2). If mask is used to index y as in y[mask], then the mask is aligned with the first two axes of y, and all values where mask is True are returned:

In [98]: y[mask]
Out[98]: 
array([[  5.5,  11. ],
       [ 22. ,   3. ],
       [  4. ,   9.8],
       [ 32. ,   7.6]])

In [99]: y[mask].shape
Out[99]: (4, 2)

By the way, the same calculation could be done using Pandas like this:

import numpy as np
import pandas as pd
x = np.array([[3.4, 87],
              [5.5, 11],
              [22, 3],
              [4, 9.8],
              [41, 11.22],
              [32, 7.6]])

df = pd.DataFrame(x)
idx = df.groupby(df.index // 3)[1].idxmax()
# drop the row with the maximum value in each group
df = df.drop(idx.values, axis=0)

which yields the DataFrame:

      0     1
1   5.5  11.0
2  22.0   3.0
3   4.0   9.8
5  32.0   7.6

You might find Pandas syntax easier to use, but for the above calculation NumPy is faster.

Upvotes: 1

Related Questions