MarkD
MarkD

Reputation: 4954

Using pandas, how can I group/aggregate summing cases where boolean columns are true?

I have a DataFrame constructed from a database query. Each row in the frame has a database id, date, job, an issues boolean, and a fixed boolean. For example:

data = [
    {'id': 1, 'date': '2020-02-01', 'job': 'ABC', 'issue': True, 'fixed': False},
    {'id': 2, 'date': '2020-02-01', 'job': 'ABC', 'issue': False, 'fixed': False},
    {'id': 3, 'date': '2020-02-01', 'job': 'ABC', 'issue': True, 'fixed': True},
    {'id': 4, 'date': '2020-02-01', 'job': 'DEF', 'issue': True, 'fixed': True}
]
data_df = pd.DataFrame(data)

I want to do a groupby and agg where I am grouping by job and date, and getting the count of 'issues' and 'fixed' that are True. Something like:

result_data = [
    {'date': '2020-02-01', 'job': 'ABC', 'issue': 2, 'fixed': 1},
    {'date': '2020-02-01', 'job': 'DEF', 'issue': 1, 'fixed': 1}
]
result_df = pd.DataFrame(result_data)

The code would look something like:

result_df = data_df.groupby(['date', 'job']).agg({'issue': 'sum-true', 'fixed': 'sum-true'})

but I am not sure what 'sum-true' should be. Not, I cant just filter the whole DF by the column being true, and summing, as issue might be True, while fixed is False.

Upvotes: 1

Views: 1242

Answers (1)

boot-scootin
boot-scootin

Reputation: 12515

How about this?

>>> df.groupby(['date', 'job'])[['issue', 'fixed']].sum()
                issue  fixed
date       job              
2020-02-01 ABC    2.0    1.0
           DEF    1.0    1.0

Simply summing a boolean vector will return True counts.

And if you want the data in the exact format you specified above, just reset_index:

>>> df.groupby(['date', 'job'])[['issue', 'fixed']].sum().reset_index()
         date  job  issue  fixed
0  2020-02-01  ABC    2.0    1.0
1  2020-02-01  DEF    1.0    1.0

Upvotes: 1

Related Questions