Reputation: 2329
Assume I have the following data which can be either numpy
array or pandas
DataFrame:
array([[4092, 3],
[4095, 4],
[4097, 4],
[4124, 1],
[4128, 0],
[4129, 0],
[4131, 5],
[4132, 5],
[4133, 2],
[4134, 2]], dtype=int64)
I would like to get an array containing the minimal values in each category (2nd column). I could loop over each unique values perform the min operation and store the results but I was wondering whether there is a faster and cleaner way to do it.
The output would look like the following:
array([[4092, 3],
[4095, 4],
[4124, 1],
[4128, 0],
[4131, 5],
[4133, 2]], dtype=int64)
Upvotes: 1
Views: 1809
Reputation: 394041
In pandas it would be done by performing a groupby
and then calling min()
on the 1st column, here my df has column names 0
and 1
, I then call reset_index
to restore the grouped index back as a column, as the ordering is now a bit messed up I use ix
and 'fancy indexing' to get the order you desire:
In [22]:
result = df.groupby(1)[0].min().reset_index()
result.ix[:,[0,1]]
Out[22]:
0 1
0 4128 0
1 4124 1
2 4133 2
3 4092 3
4 4095 4
5 4131 5
The above methods are vectorised as such they will be much faster and scale much better than iterating over each row
I created the dataframe using the following code:
In [4]:
import numpy as np
a = np.array([[4092, 3],
[4095, 4],
[4097, 4],
[4124, 1],
[4128, 0],
[4129, 0],
[4131, 5],
[4132, 5],
[4133, 2],
[4134, 2]], dtype=np.int64)
a
Out[4]:
array([[4092, 3],
[4095, 4],
[4097, 4],
[4124, 1],
[4128, 0],
[4129, 0],
[4131, 5],
[4132, 5],
[4133, 2],
[4134, 2]], dtype=int64)
In [23]:
import pandas as pd
df = pd.DataFrame(a)
df
Out[23]:
0 1
0 4092 3
1 4095 4
2 4097 4
3 4124 1
4 4128 0
5 4129 0
6 4131 5
7 4132 5
8 4133 2
9 4134 2
Upvotes: 3