JLP
JLP

Reputation: 55

pandas pivot table: calculate weighted averages through aggfunc

I've got a pandas dataframe on education and income that looks basically like this.

import pandas as pd
import numpy as np

data = {
    'education': ['Low', 'High', 'High', 'Medium', 'Low', 'Low', 'High', 'Low', 'Medium', 'Medium'],
    'income': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'weights': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
}
    
df = pd.DataFrame(data, columns=['education', 'income', 'weights'])

print(df)
  education  income  weights
0       Low       1       11
1      High       2       12
2      High       3       13
3    Medium       4       14
4       Low       5       15
5       Low       6       16
6      High       7       17
7       Low       8       18
8    Medium       9       19
9    Medium      10       20

I've created a pivot table that computes the mean income for each category of education, like so:

pivot_educ_inc = pd.pivot_table(df, 
                                values='income',
                                index='education',
                                aggfunc=np.mean)

print(pivot_educ_inc)
             income
education          
High       4.000000
Low        5.000000
Medium     7.666667

What I'd really like is to use my weights column to get the weighted means of income for each level of education. But I can't find a way to define a weighted means function that I can assign to aggfunc and that can do this.

It isn't really convenient (possible?) for me to simply create a weighted dataset because the weights add up to over 100 million. Also, ideally I'd like to use the aggfunc argument because I've got many more columns like education in my dataset that I'd like to compute weighted averages for, some of which have upwards of 25 categories.

I might be totally overlooking something here, but I'm stumped.

Upvotes: 3

Views: 4043

Answers (2)

Ralubrusto
Ralubrusto

Reputation: 1501

I'm a big fan of pivot_table, so here it goes a solution using it:

pivot = df.pivot_table(values='income',
                       index='education',
                       aggfunc=lambda rows: np.average(rows, weights=df.loc[rows.index, 'weights']))

The resulting dataframe will be as follows:

             income
education          
High       4.333333
Low        5.433333
Medium     8.056604

Upvotes: 5

Mekhi
Mekhi

Reputation: 46

I would add an additional column with the weighted sum. It will then be like this:

df = pd.DataFrame(data, columns=['education', 'income', 'weights'])
df['weighted'] = df['income'] * df['weights']


pivot_educ_inc = pd.pivot_table(df, 
                                values=['weights', 'weighted'],
                                index='education',
                                aggfunc=np.sum)

pivot_educ_inc['weighted_avg'] = pivot_educ_inc['weighted'] / pivot_educ_inc['weights']

Upvotes: 1

Related Questions