drop_duplicates even more for a specific column with latest value?

Question

Is there a way to customize drop_duplicates so that it drops the "kind of" duplicates?

Example: pandas df

Year	Name	ID	City
2011	Superman	101	Metropolis
2011	Batman	102	Gotham
2012	The Batman	102	Gotham
2011	Noobmaster69	103	Online
2011	Noobmaster69	103	Online

I tried using drop_duplicates so I got this

Year	Name	ID	City
2011	Superman	101	Metropolis
2011	Batman	102	Gotham
2012	The Batman	102	Gotham
2011	Noobmaster69	103	Online

I actually want to squeeze it even more, as I want only "102" row with "The Batman" which is newer info (2012>2011) to be on the data frame. Expecting something like this

Year	Name	ID	City
2011	Superman	101	Metropolis
2012	The Batman	102	Gotham
2011	Noobmaster69	103	Online

Siva Reddy · Accepted Answer

Try this, duplicates can be easily delete with ID column.

import pandas as pd

#reads your table data
read_file = pd.read_csv("your_filename.csv")

df = pd.DataFrame(read_file)
df = df.drop_duplicates(subset='ID', keep='last')

subset = "specific_col" used to drop the items from the specific column and keep = "last" used to keep the last duplicate (removes first duplicate)

drop_duplicates even more for a specific column with latest value?

Answers (1)

Related Questions