ddd
ddd

Reputation: 5029

Select rows from dataframe with same values on several columns but different value on another

I have a pandas dataframe with four feature columns and one label column. There is some issue with the dataset. There are some rows with the same values for the features but are labelled differently. I know how to find duplicates for multiple columns using

df[df.duplicated(keep=False)]

How do I find duplicate features with conflicting labels though?

For example in the dataframe like this

    a    b    c    label
0   1    1    2     y
1   1    1    2     x
2   1    1    2     x
3   2    2    2     z
4   2    2    2     z

I want to output something below

a    b    c    label
1    1    2    y
1    1    2    x

Upvotes: 1

Views: 3144

Answers (2)

Scott Boston
Scott Boston

Reputation: 153460

IIUC, try this:

df[df.groupby(['a','b','c'])['label'].transform('nunique') > 1]

Output:

   a  b  c label
0  1  1  2     y
1  1  1  2     x
2  1  1  2     x

Upvotes: 5

Peritract
Peritract

Reputation: 769

You can pass a list of columns to the subset parameter of .duplicated() to only consider those columns when checking for duplicates.

In your case, you would call df.duplicated(subset=["a", "b", "c"], keep=False).

Upvotes: 0

Related Questions