jaymzleutz
jaymzleutz

Reputation: 165

Using pandas value_counts() under defined condition

After a lot of errors, exceptions and high blood pressure, I finally came up with this solution that works for what I needed it to: basically I need to calculate all the column values that respect a specific condition.

So, let's say I got a list of strings just like

vehicle = ['car', 'boat', 'car', 'car', 'bike', 'tank', 'DeLorean', 'tank']

I want to count which values appear more than 2 times.

Consider that the column name of the dataframe based upon the list is 'veh'.

So, this piece of code works:

df['veh'].value_counts()[df['veh'].value_counts() > 2]

The question is: why the [df['veh'].value_counts() > 2] part comes right after the "()" of value_counts()? No "." or any other linking sign that could mean something.

If I use the code

df['classi'].value_counts() > 1

(which would be the logic synthax that my limited brain can abstract), it returns boolean values.

Can someone, please, help me understanding the logic behind pandas?

I am pretty sure that pandas is awesome and the problem lies on this side of mine, but I really want to understand it. I've read a lot of material (documentation included), but could not find a solution to this gap of mine.

Thank you in advance!

Upvotes: 2

Views: 5761

Answers (2)

Khanis Rok
Khanis Rok

Reputation: 637

The following line of code

df['veh'].value_counts()

Return a pandas Series with keys as indices and number of occurrences as values

Everything between square brackets [] are filters on keys for a pandas Series. So

df['veh'].value_counts()['car']

Should return the number of occurrences of the word 'car' in column 'veh'. Which is equivalent to the corresponding value for key 'car' on the series df['veh'].value_counts()

A pandas series also accept lists of keys as indices, So

df['veh'].value_counts()[['car','boat']]

Should return the number of occurrences for the words 'car' and 'boat' respectively

Furthermore, the series accept a list of booleans as key, if it is of the same length of the series. That is, it accepts a boolean mask

When you write

df['veh'].value_counts() > 2

You make a comparison between each value on df['veh'].value_counts() and the number 2. This returns a boolean for each value, that is a boolean mask.

So you can use the boolean mask as a filter on the series you created. Thus

df['veh'].value_counts()[df['veh'].value_counts() > 2]

Returns all the occurrences for the keys where the occurrences are greater than 2

Upvotes: 3

Quang Hoang
Quang Hoang

Reputation: 150735

The logic is that you can slice a series with a boolean series of the same size:

s[bool_series]

or equivalently

s.loc[bool_series]

This is also referred as boolean indexing.

Now, your code is equivalent to:

s = df['veh'].value_counts()

bool_series = s > 2

And then either the first two lines, e.g. s[s>2]

Upvotes: 1

Related Questions