Chubaka
Chubaka

Reputation: 3135

python pandas calculate averages column by column

In pandas 0.18.1, python 2.7.6:

Imagine we have the following table:

ID,FROM_YEAR,FROM_MONTH,YEARMONTH,AREA,AREA2
1,2015,1,201501,200,100
1,2015,2,201502,200,100
1,2015,3,201503,200,100
1,2015,4,201504,200,100
1,2015,5,201505,200,100
1,2015,6,201506,200,100
1,2015,7,201507,200,100
1,2015,8,201508,200,100
1,2015,9,201509,200,100
1,2015,10,201510,200,100
1,2015,11,201511,200,100
1,2015,12,201512,200,100
1,2016,1,201601,100,200
1,2016,2,201602,100,200
1,2016,3,201603,100,200
1,2016,4,201604,100,200
1,2016,5,201605,100,200
1,2016,6,201606,100,200
1,2016,7,201607,100,200
1,2016,8,201608,100,200
1,2016,9,201609,100,200
1,2016,10,201610,100,200
1,2016,11,201611,100,200
1,2016,12,201612,100,200

Is there any way that we can do the same thing as the following MySQL query in python pandas (the merge function can probably work, but is there any way to avoid an expensive merge/ join in python pandas)?

SELECT 
ID,
FROM_YEAR,
'A' AS TYPE,
AVG(AREA) AS AREA,
AVG(AREA2) AS AREA2
 FROM table GROUP BY ID,FROM_YEAR

UNION ALL 

SELECT 
ID,
FROM_YEAR,
'B' AS TYPE,
AVG(AREA) AS AREA,
AVG(AREA2) AS AREA2
 FROM table GROUP BY ID,FROM_YEAR;

The goal here is to get an calendar year average of AREA and AREA2 columns in the following format:

ID,FROM_YEAR,TYPE,AREA,AREA2
1,2015,A,200,100
1,2016,A,100,200
1,2015,B,200,100
1,2016,B,100,200

Could any guru enlighten?

================================= One extended question =================

Thanks for the answer! I just encounter another question in a trailing 12 case:

Desired output:

ID,FROM_YEAR,FROM_MONTH,YEARMONTH,AREA,AREA2
1,2015,1,201501,NULL,NULL
1,2015,2,201502,NULL,NULL
1,2015,3,201503,NULL,NULL
1,2015,4,201504,NULL,NULL
1,2015,5,201505,NULL,NULL
1,2015,6,201506,NULL,NULL
1,2015,7,201507,NULL,NULL
1,2015,8,201508,NULL,NULL
1,2015,9,201509,NULL,NULL
1,2015,10,201510,NULL,NULL
1,2015,11,201511,NULL,NULL
1,2015,12,201512,200,100

the following code

agg=df.groupby(['ID','FROM_YEAR'])[['AREA','AREA2']].rolling(window=12).mean()

will only generate this result, where FROM_MONTH and YEARMONTH are missing.

ID,FROM_YEAR,AREA,AREA2
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,200,100

Could anyone enlighten? Thanks!

Upvotes: 1

Views: 130

Answers (1)

akuiper
akuiper

Reputation: 214957

You can use pandas.concat here it only involves one aggregation and doesn't invoke the merge/join process:

agg = df.groupby(['ID', 'FROM_YEAR'], as_index=False)[["AREA", "AREA2"]].mean()

pd.concat([agg.assign(TYPE = t) for t in ["A", "B"]], ignore_index=True)

enter image description here

Upvotes: 3

Related Questions