Reputation: 88
I would like to use Panda's groupby with multiple aggregation functions, but also including conditional statements per aggregation. Imagine having this data as an example:
df = pd.DataFrame({
'id': ['a', 'a', 'a', 'b', 'b'],
'type': ['in_scope', 'in_scope', 'exclude', 'in_scope', 'exclude'],
'value': [5, 5, 99, 20, 99]
})
INPUT DATA:
| id | in_scope | value |
|----|----------|-------|
| a | True | 5 |
| a | True | 5 |
| a | False | 99 |
| b | True | 20 |
| b | False | 99 |
And I want to do a Pandas groupby like this:
df.groupby('id').agg(
num_records=('id', 'size'),
sum_value=('value', np.sum)
)
OUTPUT OF SIMPLE GROUPBY:
| id | num_records | sum_value |
|----|-------------|-----------|
| a | 3 | 109 |
| b | 2 | 119 |
However, I would like to do the sum depending on a condition, namely that only the "in_scope" records that are defined as True
in column in_scope
should be used. Note, the first aggregation should still use the entire table. In short, this is the desired output:
DESIRED OUTPUT OF GROUPBY:
| id | num_records | sum_value_in_scope |
|----|-------------|--------------------|
| a | 3 | 10 |
| b | 2 | 20 |
I was thinking about passing two arguments to a lambda function, but I do not succeed. Of course, it can be solved by performing two separate groupbys on filtered and unfiltered data and combine them together afterwards. But I was hoping there was a shorter and more elegant way.
Upvotes: 2
Views: 3931
Reputation: 28699
Updated answer: Create a temporary column that contains values only when type
is in_scope
, then aggregate:
(
df.assign(temp=np.where(df["type"] == "in_scope", df["value"], None))
.groupby("id", as_index=False)
.agg(num_records=("type", "size"), sum_value=("temp", "sum"))
)
id num_records sum_value
a 3 10
b 2 20
Upvotes: 1
Reputation: 2791
Unfortunately, you cannot do this with aggregate, however you can do it in one step with apply and a custom function:
def f(x):
d = {}
d['num_records'] = len(x)
d['sum_value_in_scope'] = x[x.in_scope].value.sum()
return pd.Series(d, index=['num_records', 'sum_value_in_scope'])
df.groupby('id').apply(f)
Since the column df.in_scope
is already boolean, you can use it as a mask directly to filter the values which are summed. If the column you are working with is not boolean, it is better to use df.query('<your query here>')
to get the subset of the data (there are optimizations under the hood which make it faster than most other methods).
Upvotes: 2