Reputation: 140
I have a big dataframe with many duplicates in it. I want to keep the first and last entry of each duplicate but drop every duplicate in between.
I've already tried to get this done by using df.drop_duplicates with the parameters 'first' and 'last' to get two dataframes and then merge them again to one df so I have the first and last entry, but that didn't work.
df_first = df
df_last = df
df_first['Path'].drop_duplicates(keep='first', inplace=True)
df_last['Path'].drop_duplicates(keep='last', inplace=True)
Thanks for your help in advance!
Upvotes: 3
Views: 1692
Reputation: 1
**Using group by.nth which is an Updated code from previous solution to get nth entry
def keep_second_dup(duplicate):
duplicate[Columnname]=duplicate[Columnname'].value_counts()
second_duplicate=duplicate[duplicate['Count']>=1]
residual=duplicate[duplicate['Count']==1]
sec=second_duplicated.groupby([Columnname]).nth([1]).reset_index()
final_data=pd.concat([sec,residual])
final_data.drop('Count',axis=1,inplace=True)
return final_data
Upvotes: 0
Reputation: 863216
Use GroupBy.nth
for avoid duplicates if group with length is 1
:
df = pd.DataFrame({
'a':[5,3,6,9,2,4],
'Path':list('aaabbc')
})
print(df)
a Path
0 5 a
1 3 a
2 6 a
3 9 b
4 2 b
5 4 c
df = df.groupby('Path').nth([0, -1])
print (df)
a
Path
a 5
a 6
b 9
b 2
c 4
Upvotes: 4