gc5
gc5

Reputation: 9898

Pandas GroupBy using another DataFrame of one-hot encodings/overlapping masks

I have two dataframes with observations on rows and features (or group membership) on columns, e.g.:

> data_df

     a    b    c
A    1    2    1
B    0    1    3
C    0    0    1
D    2    1    1
E    1    1    1

> mask_df

    g1   g2
A    0    1
B    1    0
C    1    0
D    1    0
E    0    1

I want to group and aggregate (by sum) the values in the first dataframe (data_df) conditional on the binary values (masks) in the second dataframe (mask_df). The result should be the following (groups x features):

> aggr_df

     a    b    c
g1   2    2    5
g2   2    3    2

Is there a way in pandas to group the first dataframe (data_df) using the masks contained in a second dataframe (mask_df) in a single command?

Upvotes: 4

Views: 911

Answers (5)

gc5
gc5

Reputation: 9898

I decided to write another answer since:

  • coldspeed's answer works only with one-hot encodings
  • W-B's answer cannot be easily parallelized since it runs on dict comprehension

In my case I noticed that I could achieve the same result just by using a dot product of mask_df with data_df:

> mask_df.T.dot(data_df)

In the special case of getting the average instead of the sum, this is achievable scaling the mask_df by the number of ones for each group:

> mask_df.T.dot(data_df).div(mask_df.sum(), axis=0)

Upvotes: 2

BENY
BENY

Reputation: 323356

Notice that this will work even in the case that observations in the first dataframe (data_df) belong to multiple masks in the second dataframe (mask_df).

> pd.concat({x:data_df.mul(mask_df[x],0).sum() for x in mask_df}).unstack()

    a  b  c
g1  2  2  5
g2  2  3  2

Upvotes: 3

cs95
cs95

Reputation: 402922

You can do this cheaply with dot and groupby:

data_df.groupby(mask_df.dot(mask_df.columns)).sum()

    a  b  c
g1  2  2  5
g2  2  3  2

Where,

mask_df.dot(mask_df.columns)

A    g2
B    g1
C    g1
D    g1
E    g2
dtype: object

Which works well assuming each row always has exactly one column set to 1.

Upvotes: 3

yatu
yatu

Reputation: 88285

Here's a way using a list comprehension:

pd.DataFrame([(data_df.T * mask_df[i]).sum(axis=1) for i in mask_df.columns], 
             index = mask.columns)

    a  b  c
g1  2  2  5
g2  2  3  2

Upvotes: 1

Polkaguy6000
Polkaguy6000

Reputation: 1208

The best way to do this is to combine the dataframes. You can combine on the index by using a join statement first. df_merge = data_df.merge(aggr_df, left_on=True, right_on=True). Then you can just use df_merge for your grouping operations.

Upvotes: 2

Related Questions