kkhatri99
kkhatri99

Reputation: 926

Groupby Pandas DataFrame and calculate mean and stdev of one column

I have a Pandas DataFrame as below:

   a      b      c      d
0  Apple  3      5      7
1  Banana 4      4      8
2  Cherry 7      1      3
3  Apple  3      4      7

I would like to group the rows by column 'a' while replacing values in column 'c' by the mean of values in grouped rows and add another column with std deviation of the values in column 'c' whose mean has been calculated. The values in column 'b' or 'd' are constant for all rows being grouped. So, the desired output would be:

   a      b      c      d      e
0  Apple  3      4.5    7      0.707107
1  Banana 4      4      8      0
2  Cherry 7      1      3      0

What is the best way to achieve this?

Upvotes: 45

Views: 72479

Answers (2)

cottontail
cottontail

Reputation: 23449

If values in some columns are constant for all rows being grouped (e.g. 'b', 'd' in the OP), then you can include it into the grouper and reorder the columns later.

new_df = (
    df.groupby(['a', 'b', 'd'])['c'].agg(['mean', 'std'])   # groupby operation
    .set_axis(['c', 'e'], axis=1)                           # rename columns
    .reset_index()                                          # make groupers into columns
    [['a', 'b', 'c', 'd', 'e']]                             # reorder columns
)

You can also use named aggregation to have the groupby result have custom column names. The mean column is named 'c' and std column is named 'e' at the end of groupby.agg.

new_df = (
    df.groupby(['a', 'b', 'd'])['c'].agg([('c', 'mean'), ('e', 'std')])
    .reset_index()                                          # make groupers into columns
    [['a', 'b', 'c', 'd', 'e']]                             # reorder columns
)

res1


You can also pass arguments to groupby.agg. For example, if you need to pass ddof=0 to std() in groupby.agg, you can do so using a lambda.

new_df = (
    df.groupby(['a', 'b', 'd'])['c'].agg([('c', 'mean'), ('e', lambda g: g.std(ddof=0))])
    .reset_index()[['a', 'b', 'c', 'd', 'e']]
)

res2

Upvotes: 4

unutbu
unutbu

Reputation: 880877

You could use a groupby-agg operation:

In [38]: result = df.groupby(['a'], as_index=False).agg(
                      {'c':['mean','std'],'b':'first', 'd':'first'})

and then rename and reorder the columns:

In [39]: result.columns = ['a','c','e','b','d']

In [40]: result.reindex(columns=sorted(result.columns))
Out[40]: 
        a  b    c  d         e
0   Apple  3  4.5  7  0.707107
1  Banana  4  4.0  8       NaN
2  Cherry  7  1.0  3       NaN

Pandas computes the sample std by default. To compute the population std:

def pop_std(x):
    return x.std(ddof=0)

result = df.groupby(['a'], as_index=False).agg({'c':['mean',pop_std],'b':'first', 'd':'first'})

result.columns = ['a','c','e','b','d']
result.reindex(columns=sorted(result.columns))

yields

        a  b    c  d    e
0   Apple  3  4.5  7  0.5
1  Banana  4  4.0  8  0.0
2  Cherry  7  1.0  3  0.0

Upvotes: 77

Related Questions