Select rows from dataframe with same values on several columns but different value on another

Question

I have a pandas dataframe with four feature columns and one label column. There is some issue with the dataset. There are some rows with the same values for the features but are labelled differently. I know how to find duplicates for multiple columns using

df[df.duplicated(keep=False)]

How do I find duplicate features with conflicting labels though?

For example in the dataframe like this

    a    b    c    label
0   1    1    2     y
1   1    1    2     x
2   1    1    2     x
3   2    2    2     z
4   2    2    2     z

I want to output something below

a    b    c    label
1    1    2    y
1    1    2    x

Scott Boston · Accepted Answer

IIUC, try this:

df[df.groupby(['a','b','c'])['label'].transform('nunique') > 1]

Output:

   a  b  c label
0  1  1  2     y
1  1  1  2     x
2  1  1  2     x

Select rows from dataframe with same values on several columns but different value on another

Answers (2)

Related Questions