Vinicius
Vinicius

Reputation: 1365

Passing function names as strings to Pandas GroupBy aggregrate

In Pandas it is possible to tell how you want to aggregate your data by passing a string alias ('min' in the following example). From the docs, you have:

df.groupby('A').agg('min')

It is obvious what this is doing, but it really annoys me that I can't find anywhere in the docs a list of these string aliases and a description of what they do.

Does anyone knows a reference to these aliases?

Upvotes: 5

Views: 1926

Answers (2)

Michael Delgado
Michael Delgado

Reputation: 15442

String method names can refer to any method of the object being operated on. Additionally, if the object has an __array__ attribute (as far as I can tell, if you're calling agg or transform directly, not with groupby, resample, rolling, etc), it can refer to anything in numpy's module-level namespace (e.g. anything in np.__all__). That's not to say that everything that can be referenced will work, but you can actually reference anything in either of these namespaces.

Examples

Here's an example dataframe:

In [9]: df = pd.DataFrame({'abc': list('aaaabbcccc'), 'data': np.random.random(size=10)})

In [10]: df
Out[10]:
  abc      data
0   a  0.800357
1   a  0.619654
2   a  0.448895
3   a  0.610645
4   b  0.985249
5   b  0.179411
6   c  0.173734
7   c  0.420767
8   c  0.789766
9   c  0.525486

DataFrame & Series methods with .agg and .transform

This can be aggregated or transformed using anything DataFrame methods (as long as the shape rules applying to agg and transform are followed).

Of course, there are the aggregation methods we're all familiar with:

In [93]: df.agg("sum")
Out[93]:
abc     aaaabbcccc
data      5.553964
dtype: object

But you could really give anything in the DataFrame/Series API a whirl:

In [95]: df.transform("shift")
Out[95]:
   abc      data
0  NaN       NaN
1    a  0.800357
2    a  0.619654
3    a  0.448895
4    a  0.610645
5    b  0.985249
6    b  0.179411
7    c  0.173734
8    c  0.420767
9    c  0.789766

In [102]: df.agg("dtypes")
Out[102]:
abc      object
data    float64
dtype: object

Numpy methods with .agg and .transform

Additionally, when working directly with pandas objects, we can use numpy global methods as well. Many of these don't work the way you might expect, so user beware:

In [101]: df.data.transform("expm1")
Out[101]:
0    1.226334
1    0.858285
2    0.566580
3    0.841620
4    1.678479
5    0.196512
6    0.189739
7    0.523129
8    1.202882
9    0.691281
Name: data, dtype: float64

In [103]: df.agg("rot90")
Out[103]:
array([[0.8003565068959021, 0.619653790821421, 0.44889504260755986,
        0.6106454343417287, 0.9852492020323964, 0.17941064387786554,
        0.17373389351532997, 0.42076690363942437, 0.7897663627044728,
        0.5254860156343195],
       ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c']], dtype=object)

In [107]: df.agg("meshgrid")
Out[107]:
[array(['a', 0.8003565068959021, 'a', 0.619653790821421, 'a',
        0.44889504260755986, 'a', 0.6106454343417287, 'b',
        0.9852492020323964, 'b', 0.17941064387786554, 'c',
        0.17373389351532997, 'c', 0.42076690363942437, 'c',
        0.7897663627044728, 'c', 0.5254860156343195], dtype=object)]

In [109]: df.agg("diag")
Out[109]: array(['a', 0.619653790821421], dtype=object)

Methods available to GroupBy, Window, and Resample operations

These numpy methods aren't available directly to Groupby, Rolling, Expanding, Resample, etc objects. But you can still call anything in the pandas API available to these objects:

In [117]: df.groupby('abc').agg("dtypes")
Out[117]:
        data
abc
a    float64
b    float64
c    float64

In [129]: df.groupby("abc").agg("ohlc")
Out[129]:
         data
         open      high       low     close
abc
a    0.800357  0.800357  0.448895  0.610645
b    0.985249  0.985249  0.179411  0.179411
c    0.173734  0.789766  0.173734  0.525486

In [137]: df.rolling(3).data.agg("quantile", 0.9)
Out[137]:
0         NaN
1         NaN
2    0.764216
3    0.617852
4    0.910328
5    0.910328
6    0.824081
7    0.372496
8    0.715966
9    0.736910
Name: data, dtype: float64

Note that the section of the pandas API which is relevant to the object scope is the Groupby, Window, or Resampling object itself, not the DataFrame or Series. So check the API of these objects for the full API reference.

Implementation

Buried deep in the pandas internals, you can trace the handling of string aggregation operations to a couple variations on this function, currently in pandas.core.apply._try_aggregate_string_function:


    def _try_aggregate_string_function(self, obj, arg: str, *args, **kwargs):
        """
        if arg is a string, then try to operate on it:
        - try to find a function (or attribute) on ourselves
        - try to find a numpy function
        - raise
        """
        assert isinstance(arg, str)

        f = getattr(obj, arg, None)
        if f is not None:
            if callable(f):
                return f(*args, **kwargs)

            # people may try to aggregate on a non-callable attribute
            # but don't let them think they can pass args to it
            assert len(args) == 0
            assert len([kwarg for kwarg in kwargs if kwarg not in ["axis"]]) == 0
            return f

        f = getattr(np, arg, None)
        if f is not None and hasattr(obj, "__array__"):
            # in particular exclude Window
            return f(obj, *args, **kwargs)

        raise AttributeError(
            f"'{arg}' is not a valid function for '{type(obj).__name__}' object"
        )

Similarly, in many places in the test suite and internals, the logic getattr(obj, f) is used, where obj is the data structure and f is the string function name.

Upvotes: 6

ORSpecialist
ORSpecialist

Reputation: 401

https://cmdlinetips.com/2019/10/pandas-groupby-13-functions-to-aggregate/

This link provides 13 functions for agg. However, you can also use lambda functions. For example,

df = pd.DataFrame({"A": [1, 1, 2, 2,],
    "B": [1, 2, 3, 4],
    "C": [0.362838, 0.227877, 1.267767, -0.562860],})
df.groupby('A').agg(lambda x:sum(x))

Upvotes: 0

Related Questions