Reputation: 10383
I've a data frame of about 52000 rows with some duplicates, when I use
df_drop_duplicates()
I loose about 1000 rows, but I don't want to erase this rows I want to know which ones are the duplicates rows
Upvotes: 2
Views: 14347
Reputation: 31662
You could use duplicated
for that:
df[df.duplicated()]
You could specify keep
argument for what you want, from docs:
keep : {‘first’, ‘last’, False}, default ‘first’
first
: Mark duplicates asTrue
except for the first occurrence.last
: Mark duplicates asTrue
except for the last occurrence.False
: Mark all duplicates asTrue
.
Upvotes: 10
Reputation: 4547
To identify duplicates within a pandas column without dropping the duplicates, try:
Let 'Column_A' = column with duplicate entries 'Column_B' = a true/false column that marks duplicates in Column A.
df['Column_B'] = df.duplicated(subset='Column_A', keep='first')
Change the parameters to fine tune to your needs.
Upvotes: 0