Reputation: 290
I have the following dataframe:
| col1 | col2 | col3 | col4 |
|------|------|------|------|
| a | 1 | 2 | abc |
| b | 1 | 2 | abc |
| c | 3 | 2 | def |
I want the rows which have duplicates based on col2, col3, col4 for unique values of col1.
In this case the output would be:
| col1 | col2 | col3 | col4 |
|------|------|------|------|
| a | 1 | 2 | abc |
| b | 1 | 2 | abc |
df.duplicated excluding col1 wont work since I need the col1 information to be contained in the result. I have millions of rows and further analysis would be difficult without this direct information. I can't set col1 as index as some other value needs to be set as index.
Is there a pythonic/pandaic way to achieve this?
Upvotes: 0
Views: 1439
Reputation: 150815
We can use groupby:
df[df.groupby(['col2','col3','col4']).col1.transform(len) > 1]
Upvotes: 3
Reputation: 2579
df = pd.DataFrame({'col1': ['a','b','c'],
'col2':[1,1,3],
'col3': [2,2,2],
'col4':['abc','abc', 'def']})
df[df.duplicated(subset = ['col2', 'col3', 'col4'], keep = False)]
df
col1 col2 col3 col4
0 a 1 2 abc
1 b 1 2 abc
df.duplicated looks for duplicate rows in your dataframe. Subset looks for the specific columns that you want to check, and keep = False displays both rows. If you just want to see one of the duplicated rows just delete that.
Upvotes: 2
Reputation: 323386
We can using filter
df.groupby(['col2','col3','col4']).filter(lambda x : (x['col1'].nunique()==x['col1'].count())&(x['col1'].nunique()>1))
Out[65]:
col1 col2 col3 col4
0 a 1 2 abc
1 b 1 2 abc
Also duplicated
, first duplicate make sure you have duplicate value rows , second make sure you do not have only one row
df[df.duplicated(['col2','col3','col4'],keep=False)&~df.duplicated(['col1','col2','col3','col4'],keep=False)]
Out[70]:
col1 col2 col3 col4
0 a 1 2 abc
1 b 1 2 abc
Upvotes: 1