code base 5000
code base 5000

Reputation: 4102

Grouping Numpy Array and Returning Minimum Values

I have a ndarray like this:

data = [(1,"YES", 54.234), 
         (1,"YES", 1.0001), 
         (2,"YES", 4.234),
         (3,"YES", 0.234)]
dtypes = [("GROUPID", np.int), 
          ("HASNEAR", "|S255"), 
          ("DISTANCE", np.float64)]
array = np.array(data, dtype=dtypes)

Is there a way to group the data and return only the Minimum distance in each group in a new array?

In my example, I have 4 rows. After the group and return minimum, I would expect only 3 rows returned. One for each GROUPID value.

If numpy arrays aren't the right tool, could you do this in Pandas?

Thank you

Upvotes: 1

Views: 739

Answers (3)

Eelco Hoogendoorn
Eelco Hoogendoorn

Reputation: 10759

AS illustrated by others, you can do this in pandas, but it is a relatively heavyweight abstraction that introduces all kinds of other complexities that you may or may not be interested in.

The numpy_indexed package specializes in these kind of operations in isolation:

import numpy_indexed as npi
npi.group_by(data['GROUPID']).min(data['DISTANCE'])

Upvotes: 2

acidtobi
acidtobi

Reputation: 1365

Create a pandas DataFrame, group by GROUPID and aggregate by min():

df = pd.DataFrame(data, columns=('GROUPID','HASNEAR','DISTANCE'))
df.groupby('GROUPID').min()

Upvotes: 2

EdChum
EdChum

Reputation: 393933

IIUC you can do this in pandas:

In [8]:
import pandas as pd
# construct a df
df = pd.DataFrame(array)
df

Out[8]:
   GROUPID HASNEAR  DISTANCE
0        1  b'YES'   54.2340
1        1  b'YES'    1.0001
2        2  b'YES'    4.2340
3        3  b'YES'    0.2340

You can now groupby on GROUPID column, call idxmin to return the index of the min value for the column of interest and use this to filter the orig df:

In [9]:
df.loc[df.groupby('GROUPID')['DISTANCE'].idxmin()]

Out[9]:
   GROUPID HASNEAR  DISTANCE
1        1  b'YES'    1.0001
2        2  b'YES'    4.2340
3        3  b'YES'    0.2340

You can see what idxmin returns is the index of the min values:

In [10]:
df.groupby('GROUPID')['DISTANCE'].idxmin()

Out[10]:
GROUPID
1    1
2    2
3    3
Name: DISTANCE, dtype: int64

You can convert back to a numpy array by calling .values:

In [11]:
df.loc[df.groupby('GROUPID')['DISTANCE'].idxmin()].values

Out[11]:
array([[1, b'YES', 1.0001],
       [2, b'YES', 4.234],
       [3, b'YES', 0.234]], dtype=object)

Upvotes: 1

Related Questions