Reputation: 2905
I have a function that I wish to apply to a subsets of a pandas DataFrame, so that the function is calculated on all rows (until current row) from the same group - i.e. using a groupby
and then expanding
.
For example, this dataframe:
df = pd.DataFrame.from_dict(
{
'group': ['A','A','A','B','B','B'],
'time': [1,2,3,1,2,3],
'x1': [10,40,30,100,200,300],
'x2': [1,0,1,2,0,3]
}).sort_values('time')
i.e.
group time x1 x2
0 A 1 10 1
3 B 1 100 2
1 A 2 40 2
4 B 2 200 0
2 A 3 30 1
5 B 3 300 3
and this function, for example:
def foo(_df):
return _df['x1'].max() * _df['x2'].iloc[-1]
[Edited for clarity following feedback from jezrael: my actual function is more complicated, and cannot be easily broken down into components for this task. this simple function is just for an MCVE.]
I want to do something like:
df['foo_result'] = df.groupby('group').expanding().apply(foo, raw=False)
To obtain this result:
group time x1 x2 foo_result
0 A 1 10 1 10
3 B 1 100 2 200
1 A 2 40 2 80
4 B 2 200 0 0
2 A 3 30 1 40
5 B 3 300 3 900
Problem is, running df.groupby('group').expanding().apply(foo, raw=False)
results in KeyError: 'x1'
.
Is there a correct way to run this, or is it not possible to do so in pandas
without breaking down my function into components?
Upvotes: 1
Views: 5266
Reputation: 2905
Applying a dataframe function on an expanding
window is apparently not possible (at least not for pandas version 0.23.0; EDITED - and also not 1.3.0), as one can see by plugging a print
statement into the function.
Running df.groupby('group').expanding().apply(lambda x: bool(print(x)) , raw=False)
on the given DataFrame (where the bool
around the print
is just to get a valid return value) returns:
0 1.0
dtype: float64
0 1.0
1 2.0
dtype: float64
0 1.0
1 2.0
2 3.0
dtype: float64
0 10.0
dtype: float64
0 10.0
1 40.0
dtype: float64
0 10.0
1 40.0
2 30.0
dtype: float64
(and so on - and also returns a dataframe with '0.0' in each cell, of course).
This shows that the expanding
window works on a column-by-column basis (we see that first the expanding time
series is printed, then x1
, and so on), and does not really work on a dataframe - so a dataframe function can't be applied to it.
So, to get the obtained functionality, one would have to put the expanding
inside the dataframe function, like in the accepted answer.
Upvotes: 2
Reputation: 862741
An possible solution is to make the expanding
part of the function, and use GroupBy.apply
:
def foo1(_df):
return _df['x1'].expanding().max() * _df['x2'].expanding().apply(lambda x: x[-1], raw=True)
df['foo_result'] = df.groupby('group').apply(foo1).reset_index(level=0, drop=True)
print (df)
group time x1 x2 foo_result
0 A 1 10 1 10.0
3 B 1 100 2 200.0
1 A 2 40 2 80.0
4 B 2 200 0 0.0
2 A 3 30 1 40.0
5 B 3 300 3 900.0
This is not a direct solution to the problem of applying a dataframe function to an expanding
dataframe, but it achieves the same functionality.
Upvotes: 2