Reputation: 165
After a lot of errors, exceptions and high blood pressure, I finally came up with this solution that works for what I needed it to: basically I need to calculate all the column values that respect a specific condition.
So, let's say I got a list of strings just like
vehicle = ['car', 'boat', 'car', 'car', 'bike', 'tank', 'DeLorean', 'tank']
I want to count which values appear more than 2 times.
Consider that the column name of the dataframe based upon the list is 'veh'.
So, this piece of code works:
df['veh'].value_counts()[df['veh'].value_counts() > 2]
The question is: why the [df['veh'].value_counts() > 2]
part comes right after the "()" of value_counts()
? No "." or any other linking sign that could mean something.
If I use the code
df['classi'].value_counts() > 1
(which would be the logic synthax that my limited brain can abstract), it returns boolean values.
Can someone, please, help me understanding the logic behind pandas?
I am pretty sure that pandas is awesome and the problem lies on this side of mine, but I really want to understand it. I've read a lot of material (documentation included), but could not find a solution to this gap of mine.
Thank you in advance!
Upvotes: 2
Views: 5761
Reputation: 637
The following line of code
df['veh'].value_counts()
Return a pandas Series with keys as indices and number of occurrences as values
Everything between square brackets [] are filters on keys for a pandas Series. So
df['veh'].value_counts()['car']
Should return the number of occurrences of the word 'car' in column 'veh'. Which is equivalent to the corresponding value for key 'car' on the series df['veh'].value_counts()
A pandas series also accept lists of keys as indices, So
df['veh'].value_counts()[['car','boat']]
Should return the number of occurrences for the words 'car' and 'boat' respectively
Furthermore, the series accept a list of booleans as key, if it is of the same length of the series. That is, it accepts a boolean mask
When you write
df['veh'].value_counts() > 2
You make a comparison between each value on df['veh'].value_counts()
and the number 2. This returns a boolean for each value, that is a boolean mask.
So you can use the boolean mask as a filter on the series you created. Thus
df['veh'].value_counts()[df['veh'].value_counts() > 2]
Returns all the occurrences for the keys where the occurrences are greater than 2
Upvotes: 3
Reputation: 150735
The logic is that you can slice a series with a boolean series of the same size:
s[bool_series]
or equivalently
s.loc[bool_series]
This is also referred as boolean indexing.
Now, your code is equivalent to:
s = df['veh'].value_counts()
bool_series = s > 2
And then either the first two lines, e.g. s[s>2]
Upvotes: 1