Reputation: 23800
Given the following Series:
sr = pd.Series([5, 5, 5, 6, 6, 6, 7, 7, 8, 8, 8])
I want to find the values that occur 3 times. This is my solution which seems to work but looks very strange:
(sr.value_counts() == 3)[sr.value_counts() == 3].index.values
Is there any other/obvious way I am missing?
Upvotes: 3
Views: 242
Reputation: 4233
You could also use .where
:
sr.where(sr.value_counts()==3).dropna().index
# Output:
Int64Index([5, 6, 8], dtype='int64')
Upvotes: 1
Reputation: 164783
Your logic is fine, you just shouldn't repeat the most expensive part, which is the counting. Store this in a variable and reuse. You may also not need to retrieve the underlying NumPy array, pd.Index
objects are often sufficient:
sr = pd.Series([5, 5, 5, 6, 6, 6, 7, 7, 8, 8, 8])
counts = sr.value_counts()
res = counts[counts == 3].index
# Int64Index([8, 6, 5], dtype='int64')
The reason there's no ready-made method for what you want is any solution will require minimum O(n) time complexity, which is the complexity for value_counts
. There's no way round this.
One alternative, dict
-based collections.Counter
, will be less efficient when it comes to filtering by count. Since NumPy arrays are stored efficiently in memory, Boolean filtering is efficient relative to dictionary iteration.
Upvotes: 3
Reputation: 323356
Using loc
sr.value_counts().loc[lambda x : x==3].index
Out[162]: Int64Index([8, 6, 5], dtype='int64')
Upvotes: 2
Reputation: 51395
@jpp's answer is probably the one you should go with, but here is a weird alternative (just for fun):
sr.groupby(sr).filter(lambda x: len(x) == 3).unique()
#array([5, 6, 8])
Upvotes: 2