gc5
gc5

Reputation: 9868

How to filter results of a groupby in pandas

I am trying to filter out a result of a groupby.

I have this table:

A       B       C

A0      B0      0.5
A1      B0      0.2
A2      B1      0.6
A3      B1      0.4
A4      B2      1.0
A5      B2      1.2

A is the index and it is unique.

Secondly I have this list:

['A0', 'A1', 'A4']

I want to group by B and extract for each group the row with the highest value of C. This row MUST be chosen between all the rows in each group, giving highest priority to the rows with index present in the list above.

The result for this data and code has to be:

A       B       C

A0      B0      0.5
A2      B1      0.6
A4      B2      1.0

The pseudo code for this I think has to be:

group by B
for each group G:
    intersect group G rows index with indexes in the list
    if intersection is not void:
        the group G becomes the intersection
    sort the rows by C in ascending order
    take the first row as representative for this group

How can I do it in pandas?

Thanks

Upvotes: 2

Views: 2641

Answers (2)

gc5
gc5

Reputation: 9868

I solved in this way:

# a is the dataframe, s the series
s = ['A0', 'A1', 'A4']

# take results for the intersection
results_intersection = a.sort('C', ascending=False).groupby(lambda x: a.ix[x, 'B'] if a.ix[x, 'A'] in s else np.nan).first()

# take remaining results
missing_results_B = set(a['B'].value_counts().index) - set(results_intersection.index)
results_addendum = a[a['B'].isin(missing_results_B)].groupby('B').first()
del results_intersection['B']

# concatenate
results = pd.concat([results_intersection, results_addendum])

Hope it helps and I did not forget anything..

Upvotes: 1

LondonRob
LondonRob

Reputation: 78733

Here's a general solution. It's not pretty but it works:

def filtermax(g, filter_on, filter_items, max_over):
    infilter = g.index.isin(filter_items).sum() > 0
    if infilter:
        return g[g[max_over] == g.ix[filter_items][max_over].max()]
    else:
        return g[g[max_over] == g[max_over].max()]
    return g

which gives:

>>> x.groupby('B').apply(filtermax, 'A', ['A0', 'A1', 'A4'], 'C')
        B    C
B  A          
B0 A0  B0  0.5
B1 A2  B1  0.6
B2 A4  B2  1.0

If anyone can work out how to stop B being added as an index (at least on my system x.groupby('B', as_index=False doesn't help!) then this solution's pretty much perfect!

Upvotes: 3

Related Questions