Reputation: 857
I am working on a large dataset and there are a few duplicates in my index. I'd like to (perhaps visually) check what these duplicated rows are like and then decide which one to drop. Is there a way that I can select the slice of the dataframe that have duplicated indices (or duplicates in any columns)?
Any help is appreciated.
Upvotes: 7
Views: 21854
Reputation: 2485
You can use pandas.duplicated
and then slice it using a boolean
. For more information on any method or advanced features, I would advise you to always check in its docstring.
Well, this would solve the case for you:
df[df.duplicated('Column Name', keep=False) == True]
Here, keep=False will return all those rows having duplicate values in that column.
Upvotes: 13
Reputation: 97281
use duplicated
method of DataFrame
:
df.duplicated(cols=[...])
See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html
EDIT
You can use:
df[df.duplicated(cols=[...]) | df.duplicated(cols=[...], take_last=True)]
or, you can use groupby
and filter
:
df.groupby([...]).filter(lambda df:df.shape[0] > 1)
or apply
:
df.groupby([...], group_keys=False).apply(lambda df:df if df.shape[0] > 1 else None)
Upvotes: 6