Reputation: 661
Below is a subset of a pandas
dataframe
I have and I am trying to remove multiple rows based on some conditions.
code1 code2 grp1 grp2 dist_km
0 M001 M002 AAA AAA 112
1 M001 M003 AAA IHH 275
2 M002 M005 AAA XXY 150
3 M002 M004 AAA AAA 65
4 M003 M443 IHH GRR 50
5 M003 M667 IHH IHH 647
6 M003 M664 IHH FFG 336
So I would only like to keep the rows where grp1
is the same as grp2
for each code1
but only where dist_km
is the smallest value for that specific code1
.
For the example above, only these rows will remain:
code1 code2 grp1 grp2 dist_km
0 M001 M002 AAA AAA 112
3 M002 M004 AAA AAA 65
What would be the easiest way to do this?
Upvotes: 4
Views: 357
Reputation: 18208
If creating temporary dataframe is not issue then, you can try using transform
:
tmp = df[df.groupby('code1')['dist_km'].transform('min') == df['dist_km']]
df1 = tmp[tmp['grp1'] == tmp['grp2']]
Or you can also try:
new_df = df.loc[df.groupby('code1')['dist_km'].idxmin()][df['grp1']==df['grp2']]
Upvotes: 2
Reputation: 36
You can do this by filtering your dataframe, applying a groupby/agg and then merge back.
result_df = df.loc[df.grp1 == df.grp2].groupby('code1').agg({'dist_km': min})
df = pd.merge(df, result_df, how='inner',
left_on=['code1', 'dist_km'], right_on=['code1', 'dist_km'])
Upvotes: 0
Reputation: 323226
No need groupby
using sort_values
with drop_duplicates
df.sort_values('dist_km').drop_duplicates('code1').query('grp1==grp2')
code1 code2 grp1 grp2 dist_km
3 M002 M004 AAA AAA 65
0 M001 M002 AAA AAA 112
Upvotes: 5
Reputation: 38415
Use two conditions
df.loc[(df['dist_km'] == df.groupby('code1')['dist_km'].transform('min')) & (df['grp1'] == df['grp2'])]
code1 code2 grp1 grp2 dist_km
0 M001 M002 AAA AAA 112
3 M002 M004 AAA AAA 65
Upvotes: 4
Reputation: 2854
This is one way that this could work by chaining a bunch of conditions. I've commented them all to make it clear at each step (the order matters):
codes = df.code1.unique() # gets unique codes
splitdfs = []
for code in codes:
tempdf = df[df.code1 == code] # select all code1
tempdf = tempdf[tempdf.dist_km == tempdf.dist_km.min()] # select dist_km is min
tempdf = tempdf[tempdf.grp1 == tempdf.grp2] # select grp1 == grp2 (must be AFTER selecting lowest dist_km)
splitdfs.append(tempdf)
selectdf = pd.concat(splitdfs)
Upvotes: 0