Roman Kazmin
Roman Kazmin

Reputation: 981

How to group dataframe by column and receive new column for every group

I have the following dataframe:

df = pd.DataFrame({'timestamp' : [10,10,10,20,20,20], 'idx': [1,2,3,1,2,3], 'v1' : [1,2,4,5,1,9], 'v2' : [1,2,8,5,1,2]})

    timestamp   idx     v1  v2
0   10           1      1   1
1   10           2      2   2
2   10           3      4   8
3   20           1      5   5
4   20           2      1   1
5   20           3      9   2

I'd like to group data by timestamp and calculate the following cumulative statistic: np.sum(v1*v2) for every timestamp. I'd like to see the following result:

    timestamp   idx     v1  v2  stat
0   10           1      1   1   37
1   10           2      2   2   37
2   10           3      4   8   37
3   20           1      5   5   44
4   20           2      1   1   44
5   20           3      9   2   44

I'm trying to do the following:

def calc_some_stat(d):
    return np.sum(d.v1 * d.v2)

df.loc[:, 'stat'] = df.groupby('timestamp').apply(calc_some_stat)

But for stat columns I receive all NaN values - what is wrong in my code?

Upvotes: 2

Views: 43

Answers (1)

Henry Ecker
Henry Ecker

Reputation: 35686

We want groupby transform here not groupby apply:

df['stat'] = (df['v1'] * df['v2']).groupby(df['timestamp']).transform('sum')

If we really want to use the function we need to join back to scale up the aggregated DataFrame:

def calc_some_stat(d):
    return np.sum(d.v1 * d.v2)


df = df.join(
    df.groupby('timestamp').apply(calc_some_stat)
        .rename('stat'),  # Needed to use join but also sets the col name
    on='timestamp'
)

df:

   timestamp  idx  v1  v2  stat
0         10    1   1   1    37
1         10    2   2   2    37
2         10    3   4   8    37
3         20    1   5   5    44
4         20    2   1   1    44
5         20    3   9   2    44

The issue is that groupby apply is producing summary information:

timestamp
10    37
20    44
dtype: int64

This does not assign back to the DataFrame naturally as there are only 2 rows when the initial DataFrame has 6. We either need to use join to scale these 2 rows up to align with the original DataFrame, or we can avoid all of this using groupby transform which is designed to produce a:

like-indexed DataFrame on each group and return a DataFrame having the same indexes as the original object filled with the transformed values

Upvotes: 3

Related Questions