Reputation: 93
A dataframe (pandas) has two columns. It is required to remove those rows for which the entry in 1st column has no duplicates.
Example data:
1 A
1 B
2 A
3 D
2 C
4 E
4 E
Expected output
1 A
1 B
2 A
2 C
4 E
4 E
In other words, it is required to remove all single-occuring (implies unique) values from 1st column. What would be fastest way to achieve this in python (~50k rows)?
Upvotes: 8
Views: 8497
Reputation: 76917
One way is to use duplicated() method
df.duplicated('c1')
default flags all but first, and take_last=True
gives the others.
In [600]: df[df.duplicated('c1') | df.duplicated('c1', take_last=True)]
Out[600]:
c1 c2
0 1 A
1 1 B
2 2 A
4 2 C
5 4 E
6 4 E
Upvotes: 6
Reputation: 217
Here's one way: Assume the dataframe is 'd' and the columns are named 'a' and 'b'. First, get the number of times each unique value in 'a' appears:
e = d['a'].value_counts()
Then get the list of values greater than 1, and return the rows whose first column is a member of that list:
d[d['a'].isin(e[e>1].index)]
Upvotes: 4