rj dj
rj dj

Reputation: 290

Pandas - check if other columns have duplicates based on a different column

I have the following dataframe:

| col1 | col2 | col3 | col4 |
|------|------|------|------|
| a    | 1    | 2    | abc  |
| b    | 1    | 2    | abc  |
| c    | 3    | 2    | def  |

I want the rows which have duplicates based on col2, col3, col4 for unique values of col1.

In this case the output would be:

| col1 | col2 | col3 | col4 |
|------|------|------|------|
| a    | 1    | 2    | abc  |
| b    | 1    | 2    | abc  |

df.duplicated excluding col1 wont work since I need the col1 information to be contained in the result. I have millions of rows and further analysis would be difficult without this direct information. I can't set col1 as index as some other value needs to be set as index.

Is there a pythonic/pandaic way to achieve this?

Upvotes: 0

Views: 1439

Answers (3)

Quang Hoang
Quang Hoang

Reputation: 150815

We can use groupby:

df[df.groupby(['col2','col3','col4']).col1.transform(len) > 1]

Upvotes: 3

Ben Pap
Ben Pap

Reputation: 2579

df = pd.DataFrame({'col1': ['a','b','c'],
                  'col2':[1,1,3],
                  'col3': [2,2,2],
                  'col4':['abc','abc', 'def']})

df[df.duplicated(subset = ['col2', 'col3', 'col4'], keep = False)]

df
    col1    col2    col3    col4
0    a       1        2      abc
1    b       1        2      abc

df.duplicated looks for duplicate rows in your dataframe. Subset looks for the specific columns that you want to check, and keep = False displays both rows. If you just want to see one of the duplicated rows just delete that.

Upvotes: 2

BENY
BENY

Reputation: 323386

We can using filter

df.groupby(['col2','col3','col4']).filter(lambda x : (x['col1'].nunique()==x['col1'].count())&(x['col1'].nunique()>1))
Out[65]: 
  col1  col2  col3 col4
0    a     1     2  abc
1    b     1     2  abc

Also duplicated, first duplicate make sure you have duplicate value rows , second make sure you do not have only one row

df[df.duplicated(['col2','col3','col4'],keep=False)&~df.duplicated(['col1','col2','col3','col4'],keep=False)]
Out[70]: 
  col1  col2  col3 col4
0    a     1     2  abc
1    b     1     2  abc

Upvotes: 1

Related Questions