Reputation: 153
Can pandas groupby use groupby.apply(func)
and inside the func
use another instance of .apply()
without duplicating and overwriting data?
In a way, the use of .apply()
is nested.
Python 3.7.3
pandas==0.25.1
import pandas as pd
def dummy_func_nested(row):
row['new_col_2'] = row['value'] * -1
return row
def dummy_func(df_group):
df_group['new_col_1'] = None
# apply dummy_func_nested
df_group = df_group.apply(dummy_func_nested, axis=1)
return df_group
def pandas_groupby():
# initialize data
df = pd.DataFrame([
{'country': 'US', 'value': 100.00, 'id': 'a'},
{'country': 'US', 'value': 95.00, 'id': 'b'},
{'country': 'CA', 'value': 56.00, 'id': 'y'},
{'country': 'CA', 'value': 40.00, 'id': 'z'},
])
# group by country and apply first dummy_func
new_df = df.groupby('country').apply(dummy_func)
# new_df and df should have the same list of countries
assert new_df['country'].tolist() == df['country'].tolist()
print(df)
if __name__ == '__main__':
pandas_groupby()
The above code should return
country value id new_col_1 new_col_2
0 US 100.0 a None -100.0
1 US 95.0 b None -95.0
2 CA 56.0 y None -56.0
3 CA 40.0 z None -40.0
However, the code returns
country value id new_col_1 new_col_2
0 US 100.0 a None -100.0
1 US 95.0 a None -95.0
2 US 56.0 a None -56.0
3 US 40.0 a None -40.0
This behavior only appears to happen when both groups have an equal amount of rows. If one group has more rows, then the output is as expected.
Upvotes: 4
Views: 703
Reputation: 153
When using groupby we should avoid using apply() methods inside of functions that use apply()
The correct code that produces desired results is below.
Disclaimer: the code could be written more efficiently. The purpose is to demonstrate that we should avoid calling apply()
methods inside of groupby.apply()
. It has adverse affects if the groups that we're applying it to have an equal amount of rows in each group. If the number of rows in each group is not equal, everything goes smoothly. Again, this only happens when groups have an equal amount of rows.
Shoutout to user: u10-forward
import pandas as pd
def dummy_func_nested(df):
df['new_col_2'] = df['value'] * -1
return df
def dummy_func(df_group):
df_group['new_col_1'] = None
# apply dummy_func_nested
df_group = dummy_func_nested(df_group)
return df_group
def pandas_groupby():
# initialize data
df = pd.DataFrame([
{'country': 'US', 'value': 100.00, 'id': 'a'},
{'country': 'US', 'value': 95.00, 'id': 'b'},
{'country': 'CA', 'value': 56.00, 'id': 'y'},
{'country': 'CA', 'value': 40.00, 'id': 'z'},
])
# group by country and apply first dummy_func
new_df = df.groupby('country').apply(dummy_func)
# new_df and df should have the same list of countries
assert new_df['country'].tolist() == df['country'].tolist()
print(df)
if __name__ == '__main__':
pandas_groupby()
That said, I still think it is a bug, not being able to call apply()
methods inside of groupby.apply()
.
Upvotes: 0
Reputation: 71580
A quote from the documentation:
In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.
Try changing the below code in your code:
def dummy_func(df_group):
df_group['new_col_1'] = None
# apply dummy_func_nested
df_group = df_group.apply(dummy_func_nested, axis=1)
return df_group
To:
def dummy_func(df_group):
df_group['new_col_1'] = None
# apply dummy_func_nested
df_group = dummy_func_nested(df_group)
return df_group
You don't need the apply
.
Of course, the more efficient way would be:
df['new_col_1'] = None
df['new_col_2'] = -df['value']
print(df)
Or:
print(df.assign(new_col_1=None, new_col_2=-df['value']))
Upvotes: 2