Reputation: 171
Looking for something like
df.groubpy('key').aggregate(combination(columnA, columnB))
instead of
df['combination'] = combination(columnA, columnB)
df.groupby('key')['combination'].aggregate()
The only requirement is that the combination of columns is calculated after the groupby.
I seems natural, logically wise, for some cases to first groupby and then aggregate.
One example would be different aggregate functions for different combinations of columns that use the same groups.
Looking for
instead of
key ColumnA ColumnB
key1 1 1
key2 2 2
key1 3 3
key3 4 4
key2 5 5
#can do
df['combination'] = df.columnA * df.columnB
df.groupby('key').mean()
#looking for
grouped = df.groupby('key')
grouped.mean(columnA * columnB)
Upvotes: 2
Views: 1122
Reputation: 13377
Ok, so I think the answer you are looking for is - we don't do that, because of vectorization in python
Consider the below code.
Now in essence - python is often optimized to execute certain mathematical operations in a vectorized way (take numpy
or pandas
for instance) - which means - applying it to the whole vector is faster, than breaking it down into chunks, and executing it then.
So e.g. df["A"].mul(df["B"])
will be faster than: df.apply(lambda X: X["A"]*X["B"], axis=0)
. Same goes for grouping - it's just way more scalable that way.
Try the below code - it's in essence what you were referring to - so doing operation before Vs after groupby(...)
. The vectorized solution scales up pretty fast, even though you materialize additional column - the more rows you process the bigger difference you will see.
Edit
I added vectorized solution on grouped data, so we have:
(1) we group, we evaluate lazily line by line
(2) we process full df in a vectorized way, we group we apply built-in aggregating function
(3) we group, we process in a vectorized way group, by group, we do aggregating function
in essence - from the result we see breaking down into chunks slows down the processing, regardless whether it's per groups, or per record - so vectorized solution scales better than any kind of custom solution that we can apply on top.
import pandas as pd
import numpy as np
import time
x=np.random.randint(1,9,(3000,5))
df=pd.DataFrame(x, columns=[f"column{l}" for l in list("ABCDE")])
df["cat"]=np.random.choice([f"key{l}" for l in list("ABCDEFG")], size=3000)
df2=df3=df
#print(df)
s=time.time()
df.groupby("cat").apply(lambda z: np.prod(z.values, axis=1).mean()).pipe(print)
e=time.time()-s
print(f"method 1: {e} s")
s=time.time()
df2["prod"]=np.prod(df[[f"column{l}" for l in list("ABCDE")]], axis=1)
df2.groupby("cat")["prod"].mean().pipe(print)
e=time.time()-s
print(f"method 2: {e} s")
s=time.time()
df3=list(map(lambda x: (x[0], np.prod(x[1][[f"column{l}" for l in list("ABCDE")]], axis=1).mean()), df3.groupby("cat")))
print(df3)
e=time.time()-s
print(f"method 3: {e} s")
Upvotes: 2