Itamar Mushkin
Itamar Mushkin

Reputation: 2905

Apply expanding function on dataframe

I have a function that I wish to apply to a subsets of a pandas DataFrame, so that the function is calculated on all rows (until current row) from the same group - i.e. using a groupby and then expanding.

For example, this dataframe:

df = pd.DataFrame.from_dict(
    {
        'group': ['A','A','A','B','B','B'],
        'time': [1,2,3,1,2,3],
        'x1': [10,40,30,100,200,300],
        'x2': [1,0,1,2,0,3]
                  }).sort_values('time')

i.e.

    group   time    x1      x2
0   A       1       10      1
3   B       1       100     2
1   A       2       40      2
4   B       2       200     0
2   A       3       30      1
5   B       3       300     3

and this function, for example:

def foo(_df):
    return _df['x1'].max() * _df['x2'].iloc[-1]

[Edited for clarity following feedback from jezrael: my actual function is more complicated, and cannot be easily broken down into components for this task. this simple function is just for an MCVE.]

I want to do something like: df['foo_result'] = df.groupby('group').expanding().apply(foo, raw=False)

To obtain this result:

    group   time    x1  x2  foo_result
0   A       1       10  1   10
3   B       1       100 2   200
1   A       2       40  2   80
4   B       2       200 0   0
2   A       3       30  1   40
5   B       3       300 3   900

Problem is, running df.groupby('group').expanding().apply(foo, raw=False) results in KeyError: 'x1'.

Is there a correct way to run this, or is it not possible to do so in pandas without breaking down my function into components?

Upvotes: 1

Views: 5266

Answers (2)

Itamar Mushkin
Itamar Mushkin

Reputation: 2905

Applying a dataframe function on an expanding window is apparently not possible (at least not for pandas version 0.23.0; EDITED - and also not 1.3.0), as one can see by plugging a print statement into the function.

Running df.groupby('group').expanding().apply(lambda x: bool(print(x)) , raw=False) on the given DataFrame (where the bool around the print is just to get a valid return value) returns:

0    1.0
dtype: float64
0    1.0
1    2.0
dtype: float64
0    1.0
1    2.0
2    3.0
dtype: float64
0    10.0
dtype: float64
0    10.0
1    40.0
dtype: float64
0    10.0
1    40.0
2    30.0
dtype: float64

(and so on - and also returns a dataframe with '0.0' in each cell, of course).

This shows that the expanding window works on a column-by-column basis (we see that first the expanding time series is printed, then x1, and so on), and does not really work on a dataframe - so a dataframe function can't be applied to it.

So, to get the obtained functionality, one would have to put the expanding inside the dataframe function, like in the accepted answer.

Upvotes: 2

jezrael
jezrael

Reputation: 862741

An possible solution is to make the expanding part of the function, and use GroupBy.apply:

def foo1(_df):
    return _df['x1'].expanding().max() * _df['x2'].expanding().apply(lambda x: x[-1], raw=True)

df['foo_result'] = df.groupby('group').apply(foo1).reset_index(level=0, drop=True)
print (df)
  group  time   x1  x2  foo_result
0     A     1   10   1        10.0
3     B     1  100   2       200.0
1     A     2   40   2        80.0
4     B     2  200   0         0.0
2     A     3   30   1        40.0
5     B     3  300   3       900.0

This is not a direct solution to the problem of applying a dataframe function to an expanding dataframe, but it achieves the same functionality.

Upvotes: 2

Related Questions