Reputation: 9898
I have two dataframes with observations on rows and features (or group membership) on columns, e.g.:
> data_df
a b c
A 1 2 1
B 0 1 3
C 0 0 1
D 2 1 1
E 1 1 1
> mask_df
g1 g2
A 0 1
B 1 0
C 1 0
D 1 0
E 0 1
I want to group and aggregate (by sum) the values in the first dataframe (data_df
) conditional on the binary values (masks) in the second dataframe (mask_df
). The result should be the following (groups x features):
> aggr_df
a b c
g1 2 2 5
g2 2 3 2
Is there a way in pandas to group the first dataframe (data_df
) using the masks contained in a second dataframe (mask_df
) in a single command?
Upvotes: 4
Views: 911
Reputation: 9898
I decided to write another answer since:
In my case I noticed that I could achieve the same result just by using a dot product of mask_df
with data_df
:
> mask_df.T.dot(data_df)
In the special case of getting the average instead of the sum, this is achievable scaling the mask_df
by the number of ones for each group:
> mask_df.T.dot(data_df).div(mask_df.sum(), axis=0)
Upvotes: 2
Reputation: 323356
Notice that this will work even in the case that observations in the first dataframe (data_df
) belong to multiple masks in the second dataframe (mask_df
).
> pd.concat({x:data_df.mul(mask_df[x],0).sum() for x in mask_df}).unstack()
a b c
g1 2 2 5
g2 2 3 2
Upvotes: 3
Reputation: 402922
You can do this cheaply with dot
and groupby
:
data_df.groupby(mask_df.dot(mask_df.columns)).sum()
a b c
g1 2 2 5
g2 2 3 2
Where,
mask_df.dot(mask_df.columns)
A g2
B g1
C g1
D g1
E g2
dtype: object
Which works well assuming each row always has exactly one column set to 1
.
Upvotes: 3
Reputation: 88285
Here's a way using a list comprehension:
pd.DataFrame([(data_df.T * mask_df[i]).sum(axis=1) for i in mask_df.columns],
index = mask.columns)
a b c
g1 2 2 5
g2 2 3 2
Upvotes: 1
Reputation: 1208
The best way to do this is to combine the dataframes. You can combine on the index by using a join statement first. df_merge = data_df.merge(aggr_df, left_on=True, right_on=True)
. Then you can just use df_merge
for your grouping operations.
Upvotes: 2