Flavio Rangel
Flavio Rangel

Reputation: 145

How can I use pandas agg to sum booleans and always obtain numbers as result?

I have a data frame with a bool type column. I would like to obtain the number of True values per id using pandas' groupby and agg functions. I've done this a bunch of times, but it seems the resulting column's type depends on the data frame. Here is an example:

import pandas as pd    
d = {'id': [1, 1, 2, 3], 'bool': [True, False, False, True]}
df = pd.DataFrame(data=d)

print(df.groupby(['id']).agg({'bool': 'sum'}))

The output I get from this code is:

   id   bool
0   1   True
1   2  False
2   3   True

Which is not what I want. Now, if agg tries to sum two True values:

import pandas as pd    
d = {'id': [1, 1, 2, 3], 'bool': [True, True, False, True]}
df = pd.DataFrame(data=d)

print(df.groupby(['id']).agg({'bool': 'sum'}))

Then I get:

   id                 bool
0   1                 2.00
1   2                 0.00
2   3                 1.00

Which is how I want.

I've seen situations in which a few rows are of type bool, whereas others are type float. It seems to be related with the number of rows grouped: if only one row, then it shows the bool value; if more than one, resulting type is float. I would like the resulting aggregated columns to always be of type float.

Pandas version is 1.0.1

Upvotes: 5

Views: 9111

Answers (4)

Nicolas Law-Dune
Nicolas Law-Dune

Reputation: 1703

You can just use the max function

df.groupby(['id']).agg({'bool': 'max'})

Upvotes: 3

Valdi_Bo
Valdi_Bo

Reputation: 31011

There is a dedicated Numpy function to count non-zero cells (True is counted as 1, False as 0). So you can run:

df.groupby(['id']).bool.agg(lambda gr: np.count_nonzero(gr))

I assume that you want integer number of True values. Otherwise append .astype(float).

Upvotes: 2

David Buck
David Buck

Reputation: 3844

You can sum and keep the output as a float in all cases with the following:

import pandas as pd    
d = {'id': [1, 1, 2, 3], 'bool': [True, False, False, True]}
df = pd.DataFrame(data=d)
print(df.groupby(['id'])['bool'].sum().astype(float))

Yields the output

id
1    1.0
2    0.0
3    1.0
Name: bool, dtype: float64

Upvotes: 3

super_zilla
super_zilla

Reputation: 51

You can use typecast to float. Use reset_index if you want a separate column for your index at the end

df.groupby(['id']).agg({'bool': 'sum'}).astype(float).reset_index()

Example:

>>> import pandas as pd    
>>> d = {'id': [1, 1, 2, 3], 'bool': [True, True, False, True]}
>>> df = pd.DataFrame(data=d)
>>> 
>>> df.groupby(['id']).agg({'bool': 'sum'}).astype(float).reset_index()
   id  bool
0   1   2.0
1   2   0.0
2   3   1.0
>>> 

Upvotes: 0

Related Questions