Ira
Ira

Reputation: 567

Pandas arrange rows based on duplicate values of selected column

I have a dataframe that looks like -

ID  NAME    AGE
1   ABC     10
2   XYZ     12
3   PQR     20
4   ABC     25
5   XYZ     30
6   PQR     20
7   KLM     22
8   NOP     16

I would like to re-arrange and keep only the following rows based on the duplicate values on NAME column -

ID  NAME    AGE
1   ABC     10
2   ABC     25
3   XYZ     12
4   XYZ     30
5   PQR     20
6   PQR     20

with following rows removed -

7   KLM     22
8   NOP     16

Thanks!

Upvotes: 1

Views: 113

Answers (2)

sharathnatraj
sharathnatraj

Reputation: 1614

While the most simplest way is the df.duplicated solution, here is another take on this using groupby and transform

df[(df.groupby('NAME')['AGE'].transform('size')>1)].sort_values('NAME')

Tried performance comparison with df.duplicated for curiosity and it about the same for both.

df=pd.concat([df]*1000000)

%timeit df1 = df[(df.groupby('NAME')['AGE'].transform('size')>1)].sort_values('NAME')
15.8 s ± 475 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df1 = df[df.duplicated(subset=['NAME'], keep=False)].sort_values(by='NAME')
15.9 s ± 632 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 0

wwnde
wwnde

Reputation: 26676

Please try

df[df.duplicated(subset=['NAME'], keep=False)].sort_values(by='NAME')

Upvotes: 2

Related Questions