Reputation: 14093
I need to groupby and filter out duplicates in a pandas dataframe based on conditions. My dataframe looks like this:
import pandas as pd
df = pd.DataFrame({'ID':[1,1,2,2,3,4,4],'Date':['1/1/2001','1/1/1999','1/1/2010','1/1/2004','1/1/2000','1/1/2001','1/1/2000'], 'type':['yes','yes','yes','yes','no','no','no'], 'source':[3,1,1,2,2,2,1]})
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('ID')
df
Date source type
ID
1 2001-01-01 3 yes
1 1999-01-01 1 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 2 no
4 2000-01-01 1 no
I need to groupby ID and type and anywhere type == yes keep the most current record only if it has the highest source. If the most current record does not have the highest source keep both records Desired output:
Date source type
ID
1 2001-01-01 3 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 2 no
4 2000-01-01 1 no
I have tried using transform but cannot figure out how to apply conditions:
grouped = df.groupby(['ID','type'])['Date'].transform(max)
df = df.loc[df['Date'] == grouped]
df
Date source type
ID
1 2001-01-01 3 yes
2 2010-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 2 no
any help is greatly appreciated
WEN here is the problem if I have a dataframe with more rows (I have about 70 columns and 5000 rows) it does not take into consideration the source max.
Date source type
ID
1 2001-01-01 3 yes
1 1999-01-01 1 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 1 yes
4 2000-01-01 2 yes
using you code I get:
Date source type
ID
1 2001-01-01 3 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 1 yes
it should be:
Date source type
ID
1 2001-01-01 3 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 1 yes
4 2000-01-01 2 yes
Upvotes: 0
Views: 97
Reputation: 323226
This will need pd.concat
grouped = df.groupby(['type'])['Date'].transform(max)# I change this line seems like you need groupby type
s = df.loc[df['Date'] == grouped].index
#here we split the df into two part , one need to drop the not match row , one should keep all row
pd.concat([df.loc[df.index.difference(s)].sort_values('Date').groupby('ID').tail(1),df.loc[s]]).sort_index()
Date source type
ID
1 2001-01-01 3 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 2 no
4 2000-01-01 1 no
Update
grouped = df.groupby(['type'])['source'].transform(max)
s = df.loc[df['source'] == grouped].index
pd.concat([df.loc[s].sort_values('Date').groupby('ID').tail(1),df.loc[df.index.difference(s)]]).sort_index()
Out[445]:
Date source type
ID
1 2001-01-01 3 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 1 yes
4 2000-01-01 2 yes
Upvotes: 2