Reputation: 4954
I have a DataFrame constructed from a database query. Each row in the frame has a database id, date, job, an issues boolean, and a fixed boolean. For example:
data = [
{'id': 1, 'date': '2020-02-01', 'job': 'ABC', 'issue': True, 'fixed': False},
{'id': 2, 'date': '2020-02-01', 'job': 'ABC', 'issue': False, 'fixed': False},
{'id': 3, 'date': '2020-02-01', 'job': 'ABC', 'issue': True, 'fixed': True},
{'id': 4, 'date': '2020-02-01', 'job': 'DEF', 'issue': True, 'fixed': True}
]
data_df = pd.DataFrame(data)
I want to do a groupby
and agg
where I am grouping by job and date, and getting the count of 'issues' and 'fixed' that are True. Something like:
result_data = [
{'date': '2020-02-01', 'job': 'ABC', 'issue': 2, 'fixed': 1},
{'date': '2020-02-01', 'job': 'DEF', 'issue': 1, 'fixed': 1}
]
result_df = pd.DataFrame(result_data)
The code would look something like:
result_df = data_df.groupby(['date', 'job']).agg({'issue': 'sum-true', 'fixed': 'sum-true'})
but I am not sure what 'sum-true'
should be. Not, I cant just filter the whole DF by the column being true, and summing, as issue
might be True
, while fixed
is False
.
Upvotes: 1
Views: 1242
Reputation: 12515
How about this?
>>> df.groupby(['date', 'job'])[['issue', 'fixed']].sum()
issue fixed
date job
2020-02-01 ABC 2.0 1.0
DEF 1.0 1.0
Simply summing a boolean vector will return True
counts.
And if you want the data in the exact format you specified above, just reset_index
:
>>> df.groupby(['date', 'job'])[['issue', 'fixed']].sum().reset_index()
date job issue fixed
0 2020-02-01 ABC 2.0 1.0
1 2020-02-01 DEF 1.0 1.0
Upvotes: 1