Reputation: 5029
I have a pandas dataframe with four feature columns and one label column. There is some issue with the dataset. There are some rows with the same values for the features but are labelled differently. I know how to find duplicates for multiple columns using
df[df.duplicated(keep=False)]
How do I find duplicate features with conflicting labels though?
For example in the dataframe like this
a b c label
0 1 1 2 y
1 1 1 2 x
2 1 1 2 x
3 2 2 2 z
4 2 2 2 z
I want to output something below
a b c label
1 1 2 y
1 1 2 x
Upvotes: 1
Views: 3144
Reputation: 153460
IIUC, try this:
df[df.groupby(['a','b','c'])['label'].transform('nunique') > 1]
Output:
a b c label
0 1 1 2 y
1 1 1 2 x
2 1 1 2 x
Upvotes: 5
Reputation: 769
You can pass a list of columns to the subset
parameter of .duplicated()
to only consider those columns when checking for duplicates.
In your case, you would call df.duplicated(subset=["a", "b", "c"], keep=False)
.
Upvotes: 0