Issue in renaming the multiple aggregation outcome columns in pandas python

Question

I got a question regarding the multiple aggregation in pandas.

Originally I have a dataset which shows the oil price, and the detail is as follows:

And the head of the dataset is as follows:

What I want to do here is to get the mean and standard deviation for each quarter of the year 2014. And the ideal output is as follows:

In my script, I have already created the quarter info by doing so .

However, one thing that I do not understand here:

If I tried to use this command to do so

brent[brent.index.year == 2014].groupby('quarter').agg({"average_price": np.mean, "std_price": np.std})

I got an error as follows:

And if I use the following script, then it works

brent[brent.index.year == 2014].groupby('quarter').agg(average_price=('Price','mean'), 
                                                   std_price=('Price','std'))

So the questions are:

What's wrong with the first approach here?
And why do we need to use the second approach here?

Thank you all for the help in advance!

jezrael · Accepted Answer

What's wrong with the first approach here?

There is passed dict, so pandas looking for columns from keys average_price, std_price and because not exist in DataFrame if return error.

Possible solution is specified column after groupby and pass list of tuples for specified new columns names with aggregate functions:

brent[brent.index.year == 2014].groupby('quarter')['Price'].agg([('average_price','mean'),('std_price',np.std)])

It is possible, because for one column Price is possible defined multiple columns names.

In later pandas versions are used named aggregations:

brent[brent.index.year == 2014].groupby('quarter').agg(average_price=('Price','mean'), 
                                                      std_price=('Price',np.std))

Here is logic - for each aggregation is defined nw column name with aggregate column and aggregate function. So is possible aggregate multiple columns with different functions:

brent[brent.index.year == 2014].groupby('quarter').agg(average_price=('Price','mean'), 
                                                      std_price=('Price',np.std),
                                                      sumQ=('quarter','sum'))

Notice, np.std has default ddof=0 and pandas std has ddof=1, so different outputs.

Issue in renaming the multiple aggregation outcome columns in pandas python

Answers (1)

Related Questions