Reputation: 3135
In pandas 0.18.1, python 2.7.6:
Imagine we have the following table:
ID,FROM_YEAR,FROM_MONTH,YEARMONTH,AREA,AREA2
1,2015,1,201501,200,100
1,2015,2,201502,200,100
1,2015,3,201503,200,100
1,2015,4,201504,200,100
1,2015,5,201505,200,100
1,2015,6,201506,200,100
1,2015,7,201507,200,100
1,2015,8,201508,200,100
1,2015,9,201509,200,100
1,2015,10,201510,200,100
1,2015,11,201511,200,100
1,2015,12,201512,200,100
1,2016,1,201601,100,200
1,2016,2,201602,100,200
1,2016,3,201603,100,200
1,2016,4,201604,100,200
1,2016,5,201605,100,200
1,2016,6,201606,100,200
1,2016,7,201607,100,200
1,2016,8,201608,100,200
1,2016,9,201609,100,200
1,2016,10,201610,100,200
1,2016,11,201611,100,200
1,2016,12,201612,100,200
Is there any way that we can do the same thing as the following MySQL query in python pandas (the merge function can probably work, but is there any way to avoid an expensive merge/ join in python pandas)?
SELECT
ID,
FROM_YEAR,
'A' AS TYPE,
AVG(AREA) AS AREA,
AVG(AREA2) AS AREA2
FROM table GROUP BY ID,FROM_YEAR
UNION ALL
SELECT
ID,
FROM_YEAR,
'B' AS TYPE,
AVG(AREA) AS AREA,
AVG(AREA2) AS AREA2
FROM table GROUP BY ID,FROM_YEAR;
The goal here is to get an calendar year average of AREA and AREA2 columns in the following format:
ID,FROM_YEAR,TYPE,AREA,AREA2
1,2015,A,200,100
1,2016,A,100,200
1,2015,B,200,100
1,2016,B,100,200
Could any guru enlighten?
================================= One extended question =================
Thanks for the answer! I just encounter another question in a trailing 12 case:
Desired output:
ID,FROM_YEAR,FROM_MONTH,YEARMONTH,AREA,AREA2
1,2015,1,201501,NULL,NULL
1,2015,2,201502,NULL,NULL
1,2015,3,201503,NULL,NULL
1,2015,4,201504,NULL,NULL
1,2015,5,201505,NULL,NULL
1,2015,6,201506,NULL,NULL
1,2015,7,201507,NULL,NULL
1,2015,8,201508,NULL,NULL
1,2015,9,201509,NULL,NULL
1,2015,10,201510,NULL,NULL
1,2015,11,201511,NULL,NULL
1,2015,12,201512,200,100
the following code
agg=df.groupby(['ID','FROM_YEAR'])[['AREA','AREA2']].rolling(window=12).mean()
will only generate this result, where FROM_MONTH and YEARMONTH are missing.
ID,FROM_YEAR,AREA,AREA2
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,NULL,NULL
1,2015,200,100
Could anyone enlighten? Thanks!
Upvotes: 1
Views: 130
Reputation: 214957
You can use pandas.concat
here it only involves one aggregation and doesn't invoke the merge/join
process:
agg = df.groupby(['ID', 'FROM_YEAR'], as_index=False)[["AREA", "AREA2"]].mean()
pd.concat([agg.assign(TYPE = t) for t in ["A", "B"]], ignore_index=True)
Upvotes: 3