Reputation: 57
I have a scenario where I have the following data
A B C D E F
a il ilp kjh 2020 1
a il ilp kjh 2021 2
a il ilp kjh 2021 3
b kl ilp kjh 2020 1
b kl hjk operio 2020 1
here I am trying to remove the duplicate rows based on col A
C
but keep max value based on col F
here is what I have done
df=pd.DataFrame(data)
df['E'] = pd.to_datetime(df['E'], unit='ms').dt.date #this I would need to do as original data is in epoch format
df = df.groupby(['A', 'C']).max()['F'].reset_index()
This gives me the correct results but other columns (D, E) are dropped in the CSV output i would like to keep them too accordingly but I do not wish to remove duplicates based on those two (D, E) columns.
desired output :
A B C D E F
a il ilp kjh 2021 3
b kl ilp oper 2020 1
b kl hjk operio 2020 1
Any help would be great
Upvotes: 3
Views: 663
Reputation: 195543
You can first sort values by F
and then drop duplicates keeping only last duplicate:
df = df.sort_values(by="F")
df = df.drop_duplicates(["A", "C"], keep="last")
print(df)
Prints:
A B C D E F
3 b kl ilp kjh 2020 1
4 b kl hjk operio 2020 1
2 a il ilp kjh 2021 3
Upvotes: 3