salhin
salhin

Reputation: 2654

Remove non-duplicated rows from pandas

This is rather simple but I can't get me head around it. Let's say for the following data frame, I want to keep only the rows with duplicated values in column y:

>>> df
   x  y
    x   y
0   1   1
1   2   2
2   3   2
3   4   3
4   5   3
5   6   3
6   7   5
7   8   2

The desired output looks like:

>>> df
    x   y
1   2   2
2   3   2
3   4   3
4   5   3
5   6   3
7   8   2

I tried this:

df[~df.duplicated('y')]

but I get this:

    x   y
0   1   1
1   2   2
3   4   3
6   7   5

Upvotes: 8

Views: 10568

Answers (1)

Anton vBR
Anton vBR

Reputation: 18906

Docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Mark duplicates as True except for the first occurrence.

  • last : Mark duplicates as True except for the last occurrence.

  • False : Mark all duplicates as True.

Meaning you are looking for:

df[df.duplicated('y',keep=False)]

Output:

    x   y
1   2   2
2   3   2
3   4   3
4   5   3
5   6   3
7   8   2

Upvotes: 19

Related Questions