Matt
Matt

Reputation: 2329

Conditional filtering in numpy arrays or pandas DataFrame

Assume I have the following data which can be either numpy array or pandas DataFrame:

array([[4092,    3],
       [4095,    4],
       [4097,    4],
       [4124,    1],
       [4128,    0],
       [4129,    0],
       [4131,    5],
       [4132,    5],
       [4133,    2],
       [4134,    2]], dtype=int64)

I would like to get an array containing the minimal values in each category (2nd column). I could loop over each unique values perform the min operation and store the results but I was wondering whether there is a faster and cleaner way to do it.

The output would look like the following:

array([[4092,    3],
       [4095,    4],
       [4124,    1],
       [4128,    0],
       [4131,    5],
       [4133,    2]], dtype=int64)

Upvotes: 1

Views: 1809

Answers (1)

EdChum
EdChum

Reputation: 394041

In pandas it would be done by performing a groupby and then calling min() on the 1st column, here my df has column names 0 and 1, I then call reset_index to restore the grouped index back as a column, as the ordering is now a bit messed up I use ix and 'fancy indexing' to get the order you desire:

In [22]:

result = df.groupby(1)[0].min().reset_index()
result.ix[:,[0,1]]
Out[22]:
      0  1
0  4128  0
1  4124  1
2  4133  2
3  4092  3
4  4095  4
5  4131  5

The above methods are vectorised as such they will be much faster and scale much better than iterating over each row

I created the dataframe using the following code:

In [4]:

import numpy as np
a = np.array([[4092,    3],
       [4095,    4],
       [4097,    4],
       [4124,    1],
       [4128,    0],
       [4129,    0],
       [4131,    5],
       [4132,    5],
       [4133,    2],
       [4134,    2]], dtype=np.int64)
a
Out[4]:
array([[4092,    3],
       [4095,    4],
       [4097,    4],
       [4124,    1],
       [4128,    0],
       [4129,    0],
       [4131,    5],
       [4132,    5],
       [4133,    2],
       [4134,    2]], dtype=int64)

In [23]:

import pandas as pd
df = pd.DataFrame(a)
df
Out[23]:
      0  1
0  4092  3
1  4095  4
2  4097  4
3  4124  1
4  4128  0
5  4129  0
6  4131  5
7  4132  5
8  4133  2
9  4134  2

Upvotes: 3

Related Questions