Reputation: 6642
Based on a recent question I came to wonder what exactly goes wrong when sorting a group using inplace=True
inside a function applied to groupby
.
df = pd.DataFrame({'A': ['a', 'a', 'b'],
'B': [3, 2, 1]})
def func(x):
x.sort_values('B', inplace=True)
return x.B.max()
dfg = df.groupby('A')
dfg.apply(func)
This gives
A
a 3
b 3
while one would expect
A
a 3
b 1
Printing x
inside the function shows that the function func
is applied to the group 'a'
during each call (the group 'b'
is "replaced" entirely):
def func(x):
print(x)
x.sort_values('B', inplace=True)
return x.B.max()
# Output (including the usual pandas apply zero-call)
A B
0 a 3
1 a 2
A B
0 a 3
1 a 2
A B
1 a 2
0 a 3
This issue can be fixed by performing the sort inside func
like x = x.sort_values('B')
. In this case, everything works as expected.
Now to my conceptual problem: As a first thought I would expect
inplace
modifies the DataFrame/DataFrameGroupBy itself, while the assignment x = x.sort_values('B')
creates a copyHowever, inspection of both the Dataframe df and the DataFrameGroupBy instance dfg reveals that they are unchanged after the apply
, which suggests that the problem is not the modification of the original instances. So what is going on here?
Upvotes: 3
Views: 264
Reputation: 111
When I did
def func(x):
x = x.copy()
x.sort_values('B', inplace=True)
return x.B.max()
It returns
A
a 3
b 1
so it verifies your first thought i.e.
x = x.sort_values('B')
creates a copyI iterated over dfg groupby object as well.
def func(x):
x = x.sort_values('B', inplace=True)
return x.B.max()
dfg = df.groupby('A')
for x in dfg:
print(func(x[1]))
It returns
3
1
Hence from my understanding, this issue is something to do with how DataFrame.groupby().apply()
iterates over its elements.
It just assigns same memory block to all it's elements and once you overwrite that block by using inplace=True
, it gets updated permanently.
Hence your dfg
and df
variables still have original values but you're still getting the wrong output.
Upvotes: 1