Reputation: 857

select rows with duplicate observations in pandas

I am working on a large dataset and there are a few duplicates in my index. I'd like to (perhaps visually) check what these duplicated rows are like and then decide which one to drop. Is there a way that I can select the slice of the dataframe that have duplicated indices (or duplicates in any columns)?

Any help is appreciated.

Upvotes: 7

Answers (2)

Pranzell

Reputation: 2485

You can use pandas.duplicated and then slice it using a boolean. For more information on any method or advanced features, I would advise you to always check in its docstring.

Well, this would solve the case for you:

df[df.duplicated('Column Name', keep=False) == True]

Here, keep=False will return all those rows having duplicate values in that column.

Upvotes: 13

HYRY

Reputation: 97281

use duplicated method of DataFrame:

df.duplicated(cols=[...])

See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html

EDIT

You can use:

df[df.duplicated(cols=[...]) | df.duplicated(cols=[...], take_last=True)]

or, you can use groupby and filter:

df.groupby([...]).filter(lambda df:df.shape[0] > 1)

or apply:

df.groupby([...], group_keys=False).apply(lambda df:df if df.shape[0] > 1 else None)

Upvotes: 6

select rows with duplicate observations in pandas

Answers (2)

Related Questions