learner
learner

Reputation: 2752

How to get back the index after groupby in pandas

I am trying to find the the record with maximum value from the first record in each group after groupby and delete the same from the original dataframe.

import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'], 
                   'cost': [1, 2, 1, 1, 3, 1, 5]})
print df 
t = df.groupby('item_id').first() #lost track of the index
desired_row = t[t.cost == t.cost.max()]
#delete this row from df

         cost
item_id      
d           5

I need to keep track of desired_row and delete this row from df and repeat the process.

What is the best way to find and delete the desired_row?

Upvotes: 9

Views: 12331

Answers (4)

Overview: Create a dataframe using an dictionary. Group by item_id and find the max value. enumerate over the grouped dataframe and use the key which is an numeric value to return the alpha index value. Create an result_df dataframe if you desire.

   df_temp = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'], 
               'cost': [1, 2, 1, 1, 3, 1, 5]})

   grouped=df_temp.groupby(['item_id'])['cost'].max()

   result_df=pd.DataFrame(columns=['item_id','cost'])

   for key, value in enumerate(grouped):
     index=grouped.index[key]
     result_df=result_df.append({'item_id':index,'cost':value},ignore_index=True)

   print(result_df.head(5))

Upvotes: 0

BENY
BENY

Reputation: 323396

Try this ?

import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
                   'cost': [1, 2, 1, 1, 3, 1, 5]})
t=df.drop_duplicates(subset=['item_id'],keep='first')
desired_row = t[t.cost == t.cost.max()]
df[~df.index.isin([desired_row.index[0]])]

Out[186]: 
   cost item_id
0     1       a
1     2       a
2     1       b
3     1       b
4     3       b
5     1       c

Upvotes: 2

Alexander
Alexander

Reputation: 109756

I am not sure of a general way, but this will work in your case since you are taking the first item of each group (it would also easily work on the last). In fact, because of the general nature of split-aggregate-combine, I don't think this is easily achievable without doing it yourself.

gb = df.groupby('item_id', as_index=False)
>>> gb.groups  # Index locations of each group.
{'a': [0, 1], 'b': [2, 3, 4], 'c': [5], 'd': [6]}

# Get the first index location from each group using a dictionary comprehension.
subset = {k: v[0] for k, v in gb.groups.iteritems()}
df2 = df.iloc[subset.values()]
# These are the first items in each groupby.
>>> df2
   cost item_id
0     1       a
5     1       c
2     1       b
6     5       d

# Exclude any items from above where the cost is equal to the max cost across the first item in each group.
>>> df[~df.index.isin(df2[df2.cost == df2.cost.max()].index)]
   cost item_id
0     1       a
1     2       a
2     1       b
3     1       b
4     3       b
5     1       c

Upvotes: 5

Vaishali
Vaishali

Reputation: 38425

Or using not in

Consider this df with few more rows

pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd', 'd','d'], 
               'cost': [1, 2, 1, 1, 3, 1, 5,1,7]})

df[~df.cost.isin(df.groupby('item_id').first().max().tolist())]

    cost    item_id
0   1       a
1   2       a
2   1       b
3   1       b
4   3       b
5   1       c
7   1       d
8   7       d

Upvotes: 1

Related Questions