zen
zen

Reputation: 93

Remove non-duplicated entries

A dataframe (pandas) has two columns. It is required to remove those rows for which the entry in 1st column has no duplicates.

Example data:

1 A
1 B
2 A
3 D
2 C
4 E
4 E

Expected output

1 A
1 B
2 A
2 C
4 E
4 E

In other words, it is required to remove all single-occuring (implies unique) values from 1st column. What would be fastest way to achieve this in python (~50k rows)?

Upvotes: 8

Views: 8497

Answers (2)

Zero
Zero

Reputation: 76917

One way is to use duplicated() method

df.duplicated('c1') default flags all but first, and take_last=True gives the others.

In [600]: df[df.duplicated('c1') | df.duplicated('c1', take_last=True)]
Out[600]:
   c1 c2
0   1  A
1   1  B
2   2  A
4   2  C
5   4  E
6   4  E

Upvotes: 6

ehaymore
ehaymore

Reputation: 217

Here's one way: Assume the dataframe is 'd' and the columns are named 'a' and 'b'. First, get the number of times each unique value in 'a' appears:

e = d['a'].value_counts()

Then get the list of values greater than 1, and return the rows whose first column is a member of that list:

d[d['a'].isin(e[e>1].index)]

Upvotes: 4

Related Questions