bkeesey
bkeesey

Reputation: 496

How do you change input parameters of pandas groupby.agg function?

I am having issues using the groupby_object.agg() method with functions where I want to change the input parameters. Is there a resource available of function names .agg() accepts, and how to pass parameters to them?

See an example below:

import pandas as pd
import numpy as np

df = pd.DataFrame({'numbers': [1, 2, 3, 2, 1, 3], 
               'colors': ['red', 'white', 'blue', 'red', 'white', np.nan], 
               'weight': [10, 10, 20, 5, 10, 20]})

df['colors'].nunique() # Returns 3 as NaN is not counted
df['colors'].nunique(dropna=False) # Returns 4 as NaN is counted

When I then groupby 'colors' how can I pass the dropna=False parameter with the function?

df.groupby('numbers').agg({'colors': 'nunique', 'weight': 'sum'})

Upvotes: 2

Views: 925

Answers (2)

ALollz
ALollz

Reputation: 59579

Though pandas has nice syntax for aggregating with dicts and NamedAggs, these can come at a huge efficiency cost. The reason is because instead of using the built-in groupby methods, which are optimized and/or implemented in cython, any .agg(lambda x: ...) or .apply(lambda x: ...) is going to take a much slower path.

What this means is that you should stick with the built-ins you can reference directly or by alias. Only as a last resort should you try to use a lambda:

In this particular case use

df.groupby('numbers')[['colors']].agg('nunique', dropna=False)

Avoid

df.groupby('numbers').agg({'colors': lambda x: x.nunique(dropna=False)})

This example shows that that while equivalent in output, and a seemingly minor change, there are enormous consequences in terms of performance, especially as the number of groups becomes large.

import perfplot
import pandas as pd
import numpy as np

def built_in(df):
    return df.groupby('numbers')[['colors']].agg('nunique', dropna=False)

def apply(df):
    return df.groupby('numbers').agg({'colors': lambda x: x.nunique(dropna=False)})

perfplot.show(
    setup=lambda n: pd.DataFrame({'numbers': np.random.randint(0, n//10+1, n),
                                  'colors': np.random.choice([np.NaN] + [*range(100)])}),
    kernels=[
        lambda df: built_in(df),
        lambda df: apply(df)],
    
    labels=['Built-In', 'Apply'],
    n_range=[2 ** k for k in range(1, 20)],
    equality_check=np.allclose,  
    xlabel='~N Groups'
)

enter image description here


But you want to do multiple aggregations and use different columns

The .groupby() part of a groupby doesn't really do that much; it simply ensures the mapping is correct. So though unintuitive, it is still much faster to aggregate with the built-in separately and concatenate the results in the end than it is to agg with a simpler dict using a lambda.

Here is an example also wanting to sum the weight column, and we can see that splitting is still a lot faster, despite needing to join manually

def built_in(df):
    return pd.concat([df.groupby('numbers')[['colors']].agg('nunique', dropna=False),
                      df.groupby('numbers')[['weight']].sum()], axis=1)

def apply(df):
    return df.groupby('numbers').agg({'colors': lambda x: x.nunique(dropna=False), 
                                      'weight': 'sum'})

perfplot.show(
    setup=lambda n: pd.DataFrame({'numbers': np.random.randint(0, n//10+1, n),
                                  'colors': np.random.choice([np.NaN] + [*range(100)]),
                                  'weight': np.random.normal(0,1,n)}),
    kernels=[
        lambda df: built_in(df),
        lambda df: apply(df)],
    
    labels=['Built-In', 'Apply'],
    n_range=[2 ** k for k in range(1, 20)],
    equality_check=np.allclose,  
    xlabel='~N Groups'
)

enter image description here

Upvotes: 6

sophocles
sophocles

Reputation: 13831

You can use this link https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

and navigate to the "Aggregation" section where you can find the different methods that are available on grouped data.

In your case, you can pass dropna=False if you use a lambda function:

df.groupby('numbers').agg({'colors': lambda x: x.nunique(dropna=False), 'weight': 'sum'})

Out[324]: 
         colors  weight
numbers                
1             2      20
2             2      15
3             2      40

Upvotes: 2

Related Questions