Reputation: 496
I am having issues using the groupby_object.agg() method with functions where I want to change the input parameters. Is there a resource available of function names .agg() accepts, and how to pass parameters to them?
See an example below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'numbers': [1, 2, 3, 2, 1, 3],
'colors': ['red', 'white', 'blue', 'red', 'white', np.nan],
'weight': [10, 10, 20, 5, 10, 20]})
df['colors'].nunique() # Returns 3 as NaN is not counted
df['colors'].nunique(dropna=False) # Returns 4 as NaN is counted
When I then groupby
'colors' how can I pass the dropna=False
parameter with the function?
df.groupby('numbers').agg({'colors': 'nunique', 'weight': 'sum'})
Upvotes: 2
Views: 925
Reputation: 59579
Though pandas
has nice syntax for aggregating with dicts and NamedAggs, these can come at a huge efficiency cost. The reason is because instead of using the built-in groupby methods, which are optimized and/or implemented in cython, any .agg(lambda x: ...)
or .apply(lambda x: ...)
is going to take a much slower path.
What this means is that you should stick with the built-ins you can reference directly or by alias. Only as a last resort should you try to use a lambda
:
In this particular case use
df.groupby('numbers')[['colors']].agg('nunique', dropna=False)
Avoid
df.groupby('numbers').agg({'colors': lambda x: x.nunique(dropna=False)})
This example shows that that while equivalent in output, and a seemingly minor change, there are enormous consequences in terms of performance, especially as the number of groups becomes large.
import perfplot
import pandas as pd
import numpy as np
def built_in(df):
return df.groupby('numbers')[['colors']].agg('nunique', dropna=False)
def apply(df):
return df.groupby('numbers').agg({'colors': lambda x: x.nunique(dropna=False)})
perfplot.show(
setup=lambda n: pd.DataFrame({'numbers': np.random.randint(0, n//10+1, n),
'colors': np.random.choice([np.NaN] + [*range(100)])}),
kernels=[
lambda df: built_in(df),
lambda df: apply(df)],
labels=['Built-In', 'Apply'],
n_range=[2 ** k for k in range(1, 20)],
equality_check=np.allclose,
xlabel='~N Groups'
)
The .groupby()
part of a groupby doesn't really do that much; it simply ensures the mapping is correct. So though unintuitive, it is still much faster to aggregate with the built-in separately and concatenate the results in the end than it is to agg with a simpler dict using a lambda.
Here is an example also wanting to sum
the weight column, and we can see that splitting is still a lot faster, despite needing to join manually
def built_in(df):
return pd.concat([df.groupby('numbers')[['colors']].agg('nunique', dropna=False),
df.groupby('numbers')[['weight']].sum()], axis=1)
def apply(df):
return df.groupby('numbers').agg({'colors': lambda x: x.nunique(dropna=False),
'weight': 'sum'})
perfplot.show(
setup=lambda n: pd.DataFrame({'numbers': np.random.randint(0, n//10+1, n),
'colors': np.random.choice([np.NaN] + [*range(100)]),
'weight': np.random.normal(0,1,n)}),
kernels=[
lambda df: built_in(df),
lambda df: apply(df)],
labels=['Built-In', 'Apply'],
n_range=[2 ** k for k in range(1, 20)],
equality_check=np.allclose,
xlabel='~N Groups'
)
Upvotes: 6
Reputation: 13831
You can use this link https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
and navigate to the "Aggregation" section where you can find the different methods that are available on grouped data.
In your case, you can pass dropna=False
if you use a lambda
function:
df.groupby('numbers').agg({'colors': lambda x: x.nunique(dropna=False), 'weight': 'sum'})
Out[324]:
colors weight
numbers
1 2 20
2 2 15
3 2 40
Upvotes: 2