Ziggy Eunicien
Ziggy Eunicien

Reputation: 2908

how to selectively filter elements in pandas group

I want to selectively remove elements of a pandas group based on their properties within the group.

Here's an example: remove all elements except the row with the highest value in the 'A' column

>>> dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc'), 'C': list('lmnopqrt')})

>>> dff
   A  B  C
0   0  a  l
1   2  a  m
2   4  b  n
3   1  b  o
4   9  b  p
5   2  b  q
6   3  c  r
7  10  c  t

>>> grped = dff.groupby('B')
>>> grped.groups
{'a': [0, 1], 'c': [6, 7], 'b': [2, 3, 4, 5]}

apply custom function/method to the groups (sort within group on col 'A', filter elements).

>>> yourGenius(grped,'A').reset_index()

returns dataframe:

   A  B  C
0   2  a  m
1   9  b  p
2  10  c  t

maybe there is a compact way to do this with a lambda function or .filter()? thanks

Upvotes: 2

Views: 2127

Answers (1)

unutbu
unutbu

Reputation: 879799

If you want to select one row per group, you could use groupby/agg to return index values and select the rows using loc.

For example, to group by B and then select the row with the highest A value:

In [171]: dff
Out[171]: 
    A  B  C
0   0  a  l
1   2  a  m
2   4  b  n
3   1  b  o
4   9  b  p
5   2  b  q
6   3  c  r
7  10  c  t

[8 rows x 3 columns]

In [172]: dff.loc[dff.groupby('B')['A'].idxmax()]
Out[172]: 
    A  B  C
1   2  a  m
4   9  b  p
7  10  c  t

another option (suggested by jezrael) which in practice is faster for a wide range of DataFrames is

dff.sort_values(by=['A'], ascending=False).drop_duplicates('B')

If you wish to select many rows per group, you could use groupby/apply with a function that returns sub-DataFrames for each group. apply will then try to merge these sub-DataFrames for you.

For example, to select every row except the last from each group:

In [216]: df = pd.DataFrame(np.arange(15).reshape(5,3), columns=list('ABC'), index=list('vwxyz')); df['A'] %= 2; df
Out[216]: 
   A   B   C
v  0   1   2
w  1   4   5
x  0   7   8
y  1  10  11
z  0  13  14


In [217]: df.groupby(['A']).apply(lambda grp: grp.iloc[:-1]).reset_index(drop=True, level=0)
Out[217]: 
   A  B  C
v  0  1  2
x  0  7  8
w  1  4  5

Another way is to use groupby/apply to return a Series of index values. Again apply will try to join the Series into one Series. You could then use df.loc to select rows by index value:

In [218]: df.loc[df.groupby(['A']).apply(lambda grp: pd.Series(grp.index[:-1]))]
Out[218]: 
   A  B  C
v  0  1  2
x  0  7  8
w  1  4  5

I don't think groupby/filter will do what you wish, since groupby/filter filters whole groups. It doesn't allow you to select particular rows from each group.

Upvotes: 6

Related Questions