Remove non-duplicated entries

Question

A dataframe (pandas) has two columns. It is required to remove those rows for which the entry in 1st column has no duplicates.

Example data:

1 A
1 B
2 A
3 D
2 C
4 E
4 E

Expected output

1 A
1 B
2 A
2 C
4 E
4 E

In other words, it is required to remove all single-occuring (implies unique) values from 1st column. What would be fastest way to achieve this in python (~50k rows)?

Zero · Accepted Answer

One way is to use duplicated() method

df.duplicated('c1') default flags all but first, and take_last=True gives the others.

In [600]: df[df.duplicated('c1') | df.duplicated('c1', take_last=True)]
Out[600]:
   c1 c2
0   1  A
1   1  B
2   2  A
4   2  C
5   4  E
6   4  E

Remove non-duplicated entries

Answers (2)

Related Questions