mcsoini
mcsoini

Reputation: 6642

What exactly goes wrong when using sort_values with inplace=True inside groupby?

Based on a recent question I came to wonder what exactly goes wrong when sorting a group using inplace=True inside a function applied to groupby.

Example and problem

df = pd.DataFrame({'A': ['a', 'a', 'b'],
                   'B': [3, 2, 1]})

def func(x):
    x.sort_values('B', inplace=True)
    return x.B.max()

dfg = df.groupby('A')
dfg.apply(func)

This gives

A
a    3
b    3

while one would expect

A
a    3
b    1

Printing x inside the function shows that the function func is applied to the group 'a' during each call (the group 'b' is "replaced" entirely):

def func(x):
    print(x)
    x.sort_values('B', inplace=True)
    return x.B.max()

# Output (including the usual pandas apply zero-call)
   A  B
0  a  3
1  a  2
   A  B
0  a  3
1  a  2
   A  B
1  a  2
0  a  3

Solution to the problem

This issue can be fixed by performing the sort inside func like x = x.sort_values('B'). In this case, everything works as expected.

Question

Now to my conceptual problem: As a first thought I would expect

However, inspection of both the Dataframe df and the DataFrameGroupBy instance dfg reveals that they are unchanged after the apply, which suggests that the problem is not the modification of the original instances. So what is going on here?

Upvotes: 3

Views: 264

Answers (1)

Mohit Rajpoot
Mohit Rajpoot

Reputation: 111

When I did

def func(x):
    x = x.copy()
    x.sort_values('B', inplace=True)
    return x.B.max()

It returns

A
a    3
b    1

so it verifies your first thought i.e.

  1. that inplace modifies the DataFrame/DataFrameGroupBy itself, while
    the assignment x = x.sort_values('B') creates a copy

I iterated over dfg groupby object as well.

def func(x):
    x = x.sort_values('B', inplace=True)
    return x.B.max()

dfg = df.groupby('A')
for x in dfg:
    print(func(x[1]))

It returns

3
1

Hence from my understanding, this issue is something to do with how DataFrame.groupby().apply() iterates over its elements. It just assigns same memory block to all it's elements and once you overwrite that block by using inplace=True, it gets updated permanently. Hence your dfg and df variables still have original values but you're still getting the wrong output.

Upvotes: 1

Related Questions