Rebecca
Rebecca

Reputation: 57

removing duplicates from the file based column and max value in row -pandas

I have a scenario where I have the following data

A   B   C    D       E      F
a   il  ilp kjh     2020    1
a   il  ilp kjh     2021    2
a   il  ilp kjh     2021    3
b   kl  ilp kjh     2020    1
b   kl  hjk operio  2020    1

here I am trying to remove the duplicate rows based on col A C but keep max value based on col F here is what I have done

df=pd.DataFrame(data)
df['E'] = pd.to_datetime(df['E'], unit='ms').dt.date #this I would need to do as original data is in epoch format
df = df.groupby(['A', 'C']).max()['F'].reset_index()

This gives me the correct results but other columns (D, E) are dropped in the CSV output i would like to keep them too accordingly but I do not wish to remove duplicates based on those two (D, E) columns.

desired output :

A   B   C   D      E    F
a   il  ilp kjh    2021 3
b   kl  ilp oper   2020 1
b   kl  hjk operio 2020 1

Any help would be great

Upvotes: 3

Views: 663

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195543

You can first sort values by F and then drop duplicates keeping only last duplicate:

df = df.sort_values(by="F")
df = df.drop_duplicates(["A", "C"], keep="last")
print(df)

Prints:

   A   B    C       D     E  F
3  b  kl  ilp     kjh  2020  1
4  b  kl  hjk  operio  2020  1
2  a  il  ilp     kjh  2021  3

Upvotes: 3

Related Questions