giannisl9
giannisl9

Reputation: 171

Combination of columns for aggregation after groupby

Question

Looking for something like

df.groubpy('key').aggregate(combination(columnA, columnB))

instead of

df['combination'] = combination(columnA, columnB)
df.groupby('key')['combination'].aggregate()

The only requirement is that the combination of columns is calculated after the groupby.


Description

I seems natural, logically wise, for some cases to first groupby and then aggregate.

One example would be different aggregate functions for different combinations of columns that use the same groups.

Looking for

  1. groupby
  2. choose combination of columns
  3. use the corresponding aggregate function

instead of

  1. create all the necessary columns (for every aggregate function)
  2. groupby (for every aggregate function)
  3. apply specific aggregate function

Example

key     ColumnA  ColumnB
key1       1        1
key2       2        2
key1       3        3
key3       4        4
key2       5        5

#can do
df['combination'] = df.columnA * df.columnB
df.groupby('key').mean()

#looking for
grouped = df.groupby('key')
grouped.mean(columnA * columnB)

Upvotes: 2

Views: 1122

Answers (1)

Georgina Skibinski
Georgina Skibinski

Reputation: 13377

Ok, so I think the answer you are looking for is - we don't do that, because of vectorization in python

Consider the below code.

Now in essence - python is often optimized to execute certain mathematical operations in a vectorized way (take numpy or pandas for instance) - which means - applying it to the whole vector is faster, than breaking it down into chunks, and executing it then.

So e.g. df["A"].mul(df["B"]) will be faster than: df.apply(lambda X: X["A"]*X["B"], axis=0). Same goes for grouping - it's just way more scalable that way.

Try the below code - it's in essence what you were referring to - so doing operation before Vs after groupby(...). The vectorized solution scales up pretty fast, even though you materialize additional column - the more rows you process the bigger difference you will see.

Edit

I added vectorized solution on grouped data, so we have:

(1) we group, we evaluate lazily line by line

(2) we process full df in a vectorized way, we group we apply built-in aggregating function

(3) we group, we process in a vectorized way group, by group, we do aggregating function

in essence - from the result we see breaking down into chunks slows down the processing, regardless whether it's per groups, or per record - so vectorized solution scales better than any kind of custom solution that we can apply on top.

import pandas as pd
import numpy as np
import time

x=np.random.randint(1,9,(3000,5))
df=pd.DataFrame(x, columns=[f"column{l}" for l in list("ABCDE")])
df["cat"]=np.random.choice([f"key{l}" for l in list("ABCDEFG")], size=3000)
df2=df3=df
#print(df)
s=time.time()
df.groupby("cat").apply(lambda z: np.prod(z.values, axis=1).mean()).pipe(print)
e=time.time()-s
print(f"method 1: {e} s")

s=time.time()
df2["prod"]=np.prod(df[[f"column{l}" for l in list("ABCDE")]], axis=1)
df2.groupby("cat")["prod"].mean().pipe(print)
e=time.time()-s
print(f"method 2: {e} s")

s=time.time()
df3=list(map(lambda x: (x[0], np.prod(x[1][[f"column{l}" for l in list("ABCDE")]], axis=1).mean()), df3.groupby("cat")))
print(df3)
e=time.time()-s
print(f"method 3: {e} s")

Upvotes: 2

Related Questions