Robin
Robin

Reputation: 8357

Pandas: Get values from column that appear more than X times

I have a data frame in pandas and would like to get all the values of a certain column that appear more than X times. I know this should be easy but somehow I am not getting anywhere with my current attempts.

Here is an example:

>>> df2 = pd.DataFrame([{"uid": 0, "mi":1}, {"uid": 0, "mi":2}, {"uid": 0, "mi":1}, {"uid": 0, "mi":1}])
>>> df2

    mi  uid
0    1   0
1    2   0
2    1   0
3    1   0

Now supposed I want to get all values from column "mi" that appear more than 2 times, the result should be

>>> <fancy query>
array([1])

I have tried a couple of things with groupby and count but I always end up with a series with the values and their respective counts but don't know how to extract the values that have count more than X from that:

>>> df2.groupby('mi').mi.count() > 2
mi
1      True
2     False
dtype: bool

But how can I use this now to get the values of mi that are true?

Any hints appreciated :)

Upvotes: 42

Views: 74565

Answers (5)

mbh86
mbh86

Reputation: 6378

Similar to @nicolaskruchten, slightly shorter version

 df2.mi.value_counts().loc[lambda x: x>5].reset_index()['index']

And if you don't need to have the result within a serie, just do this:

df2.mi.value_counts().loc[lambda x: x>5].index

Upvotes: 21

branwen85
branwen85

Reputation: 1698

I found a problem with the solution provided by @juniper- If there are more than 2 values fulfilling your condition, they will not be printed out. For example:

>>> check=pd.DataFrame({'YOB':[1991,1992,1993,1991,1995,1994,1992,1991]})

>>>vc = check.YOB.value_counts()
>>>vc
1991    3
1992    2
1995    1
1994    1
1993    1
Name: YOB, dtype: int64

Let's say we want to find years which appear more than once:

>>>vc[vc>1]
1991    3
1992    2
Name: YOB, dtype: int64

If we now want to access the actual value, we need to do:

>>>vc[vc>1].index.tolist()
[1991,1992]

Rather than call it by index, which will print out the first value only:

>>>vc[vc>1].index[0]
1991

Upvotes: 9

A.Kot
A.Kot

Reputation: 7903

from collections import Counter

counts = Counter(df2.mi)
df2[df2.mi.isin([key for key in counts if counts[key] > 2])]

Upvotes: 1

nicolaskruchten
nicolaskruchten

Reputation: 27370

I use this:

 df2.mi.value_counts().reset_index(name="count").query("count > 5")["index"]

The part before query() gives me a data frame with two columns: index and count. The query() filters on count and then we pull out the values.

Upvotes: 18

juniper-
juniper-

Reputation: 6562

Or how about this:

Create the table:

>>> import pandas as pd
>>> df2 = pd.DataFrame([{"uid": 0, "mi":1}, {"uid": 0, "mi":2}, {"uid": 0, "mi":1}, {"uid": 0, "mi":1}])

Get the counts of each occurance:

>>> vc = df2.mi.value_counts()
>>> print vc
1    3
2    1

Print out those that occur more than 2 times:

>>> print vc[vc > 2].index[0]
1

Upvotes: 34

Related Questions