Oleh Dubno
Oleh Dubno

Reputation: 153

Pandas groupby is duplicating groups when using apply twice

Can pandas groupby use groupby.apply(func) and inside the func use another instance of .apply() without duplicating and overwriting data?

In a way, the use of .apply() is nested.

Python 3.7.3 pandas==0.25.1

import pandas as pd


def dummy_func_nested(row):
    row['new_col_2'] = row['value'] * -1
    return row


def dummy_func(df_group):
    df_group['new_col_1'] = None

    # apply dummy_func_nested
    df_group = df_group.apply(dummy_func_nested, axis=1)

    return df_group


def pandas_groupby():
    # initialize data
    df = pd.DataFrame([
        {'country': 'US', 'value': 100.00, 'id': 'a'},
        {'country': 'US', 'value': 95.00, 'id': 'b'},
        {'country': 'CA', 'value': 56.00, 'id': 'y'},
        {'country': 'CA', 'value': 40.00, 'id': 'z'},
    ])

    # group by country and apply first dummy_func
    new_df = df.groupby('country').apply(dummy_func)

    # new_df and df should have the same list of countries
    assert new_df['country'].tolist() == df['country'].tolist()
    print(df)


if __name__ == '__main__':
    pandas_groupby()

The above code should return

  country  value id new_col_1  new_col_2
0      US  100.0  a      None     -100.0
1      US   95.0  b      None      -95.0
2      CA   56.0  y      None      -56.0
3      CA   40.0  z      None      -40.0

However, the code returns

  country  value id new_col_1  new_col_2
0      US  100.0  a      None     -100.0
1      US   95.0  a      None      -95.0
2      US   56.0  a      None      -56.0
3      US   40.0  a      None      -40.0

This behavior only appears to happen when both groups have an equal amount of rows. If one group has more rows, then the output is as expected.

Upvotes: 4

Views: 703

Answers (2)

Oleh Dubno
Oleh Dubno

Reputation: 153

When using groupby we should avoid using apply() methods inside of functions that use apply()

The correct code that produces desired results is below.

Disclaimer: the code could be written more efficiently. The purpose is to demonstrate that we should avoid calling apply() methods inside of groupby.apply(). It has adverse affects if the groups that we're applying it to have an equal amount of rows in each group. If the number of rows in each group is not equal, everything goes smoothly. Again, this only happens when groups have an equal amount of rows.

Shoutout to user: u10-forward

import pandas as pd


def dummy_func_nested(df):
    df['new_col_2'] = df['value'] * -1
    return df


def dummy_func(df_group):
    df_group['new_col_1'] = None

    # apply dummy_func_nested
    df_group = dummy_func_nested(df_group)

    return df_group


def pandas_groupby():
    # initialize data
    df = pd.DataFrame([
        {'country': 'US', 'value': 100.00, 'id': 'a'},
        {'country': 'US', 'value': 95.00, 'id': 'b'},
        {'country': 'CA', 'value': 56.00, 'id': 'y'},
        {'country': 'CA', 'value': 40.00, 'id': 'z'},
    ])

    # group by country and apply first dummy_func
    new_df = df.groupby('country').apply(dummy_func)

    # new_df and df should have the same list of countries
    assert new_df['country'].tolist() == df['country'].tolist()
    print(df)


if __name__ == '__main__':
    pandas_groupby()

That said, I still think it is a bug, not being able to call apply() methods inside of groupby.apply().

Upvotes: 0

U13-Forward
U13-Forward

Reputation: 71580

A quote from the documentation:

In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.

Try changing the below code in your code:

def dummy_func(df_group):
    df_group['new_col_1'] = None

    # apply dummy_func_nested
    df_group = df_group.apply(dummy_func_nested, axis=1)

    return df_group

To:

def dummy_func(df_group):
    df_group['new_col_1'] = None

    # apply dummy_func_nested
    df_group = dummy_func_nested(df_group)

    return df_group

You don't need the apply.

Of course, the more efficient way would be:

df['new_col_1'] = None
df['new_col_2'] = -df['value']
print(df)

Or:

print(df.assign(new_col_1=None, new_col_2=-df['value']))

Upvotes: 2

Related Questions