Bera
Bera

Reputation: 1949

Weighted average pandas

I have a dataframe with forest stand id, tree species, height and volume:

import pandas as pd

df=pd.DataFrame.from_items([('STAND_ID',[1,1,2,3,3,3]),('Species',['Conifer','Broadleaves','Conifer','Broadleaves','Conifer','Conifer']),
                             ('Height',[20,19,13,24,25,18]),('Volume',[200,100,300,50,100,10])])

   STAND_ID      Species  Height  Volume
0         1      Conifer      20     200
1         1  Broadleaves      19     100
2         2      Conifer      13     300
3         3  Broadleaves      24      50
4         3      Conifer      25     100
5         3      Conifer      18      10

I want to groupby and unstack by stand id and calculate volume weighted mean height so i try:

newdf=df.groupby(['STAND_ID','Species']).mean().unstack()

          Height              Volume        
Species  Broadleaves Conifer Broadleaves Conifer
STAND_ID                                        
1               19.0    20.0       100.0   200.0
2                NaN    13.0         NaN   300.0
3               24.0    21.5        50.0    55.0

The heights are of course not volume weighted means. How can i volume weight them? Like this for STAND_ID 3 and Conifer:

(25*100 + 18*10)/(100+10)=24.4

Upvotes: 4

Views: 13701

Answers (2)

Martin Alexandersson
Martin Alexandersson

Reputation: 1427

If lambda functions are confusing apply can also be used with a function definition. (And there is also a function numpy.average to calculate weighted mean)

import numpy as np
def weighted_average(group):
   weights = group['Volume']
   height = group['Height']   
   return np.average(height,weights=weights)

df.groupby(['STAND_ID','Species']).apply(func = weighted_average).unstack()

Upvotes: 13

Andrew L
Andrew L

Reputation: 7038

If I understand correctly, one way would be performing a groupby with apply:

df
   STAND_ID      Species  Height  Volume
0         1      Conifer      20     200
1         1  Broadleaves      19     100
2         2      Conifer      13     300
3         3  Broadleaves      24      50
4         3      Conifer      25     100
5         3      Conifer      18      10

df.groupby(['STAND_ID','Species']).apply(lambda x: (x['Height'] * x['Volume'].div(x['Volume'].sum())).sum()).unstack()

Species   Broadleaves    Conifer
STAND_ID                        
1                19.0  20.000000
2                 NaN  13.000000
3                24.0  24.363636

Upvotes: 6

Related Questions