Reputation: 2654
This is rather simple but I can't get me head around it. Let's say for the following data frame, I want to keep only the rows with duplicated values in column y:
>>> df
x y
x y
0 1 1
1 2 2
2 3 2
3 4 3
4 5 3
5 6 3
6 7 5
7 8 2
The desired output looks like:
>>> df
x y
1 2 2
2 3 2
3 4 3
4 5 3
5 6 3
7 8 2
I tried this:
df[~df.duplicated('y')]
but I get this:
x y
0 1 1
1 2 2
3 4 3
6 7 5
Upvotes: 8
Views: 10568
Reputation: 18906
Docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html
keep : {‘first’, ‘last’, False}, default ‘first’
first : Mark duplicates as True except for the first occurrence.
last : Mark duplicates as True except for the last occurrence.
False : Mark all duplicates as True.
Meaning you are looking for:
df[df.duplicated('y',keep=False)]
Output:
x y
1 2 2
2 3 2
3 4 3
4 5 3
5 6 3
7 8 2
Upvotes: 19