Reputation: 8357
I have a data frame in pandas and would like to get all the values of a certain column that appear more than X times. I know this should be easy but somehow I am not getting anywhere with my current attempts.
Here is an example:
>>> df2 = pd.DataFrame([{"uid": 0, "mi":1}, {"uid": 0, "mi":2}, {"uid": 0, "mi":1}, {"uid": 0, "mi":1}])
>>> df2
mi uid
0 1 0
1 2 0
2 1 0
3 1 0
Now supposed I want to get all values from column "mi" that appear more than 2 times, the result should be
>>> <fancy query>
array([1])
I have tried a couple of things with groupby and count but I always end up with a series with the values and their respective counts but don't know how to extract the values that have count more than X from that:
>>> df2.groupby('mi').mi.count() > 2
mi
1 True
2 False
dtype: bool
But how can I use this now to get the values of mi that are true?
Any hints appreciated :)
Upvotes: 42
Views: 74565
Reputation: 6378
Similar to @nicolaskruchten, slightly shorter version
df2.mi.value_counts().loc[lambda x: x>5].reset_index()['index']
And if you don't need to have the result within a serie, just do this:
df2.mi.value_counts().loc[lambda x: x>5].index
Upvotes: 21
Reputation: 1698
I found a problem with the solution provided by @juniper- If there are more than 2 values fulfilling your condition, they will not be printed out. For example:
>>> check=pd.DataFrame({'YOB':[1991,1992,1993,1991,1995,1994,1992,1991]})
>>>vc = check.YOB.value_counts()
>>>vc
1991 3
1992 2
1995 1
1994 1
1993 1
Name: YOB, dtype: int64
Let's say we want to find years which appear more than once:
>>>vc[vc>1]
1991 3
1992 2
Name: YOB, dtype: int64
If we now want to access the actual value, we need to do:
>>>vc[vc>1].index.tolist()
[1991,1992]
Rather than call it by index, which will print out the first value only:
>>>vc[vc>1].index[0]
1991
Upvotes: 9
Reputation: 7903
from collections import Counter
counts = Counter(df2.mi)
df2[df2.mi.isin([key for key in counts if counts[key] > 2])]
Upvotes: 1
Reputation: 27370
I use this:
df2.mi.value_counts().reset_index(name="count").query("count > 5")["index"]
The part before query()
gives me a data frame with two columns: index
and count
. The query()
filters on count
and then we pull out the values.
Upvotes: 18
Reputation: 6562
Or how about this:
Create the table:
>>> import pandas as pd
>>> df2 = pd.DataFrame([{"uid": 0, "mi":1}, {"uid": 0, "mi":2}, {"uid": 0, "mi":1}, {"uid": 0, "mi":1}])
Get the counts of each occurance:
>>> vc = df2.mi.value_counts()
>>> print vc
1 3
2 1
Print out those that occur more than 2 times:
>>> print vc[vc > 2].index[0]
1
Upvotes: 34