Reputation: 55
I've got a pandas dataframe on education and income that looks basically like this.
import pandas as pd
import numpy as np
data = {
'education': ['Low', 'High', 'High', 'Medium', 'Low', 'Low', 'High', 'Low', 'Medium', 'Medium'],
'income': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'weights': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
}
df = pd.DataFrame(data, columns=['education', 'income', 'weights'])
print(df)
education income weights
0 Low 1 11
1 High 2 12
2 High 3 13
3 Medium 4 14
4 Low 5 15
5 Low 6 16
6 High 7 17
7 Low 8 18
8 Medium 9 19
9 Medium 10 20
I've created a pivot table that computes the mean income for each category of education, like so:
pivot_educ_inc = pd.pivot_table(df,
values='income',
index='education',
aggfunc=np.mean)
print(pivot_educ_inc)
income
education
High 4.000000
Low 5.000000
Medium 7.666667
What I'd really like is to use my weights column to get the weighted means of income for each level of education. But I can't find a way to define a weighted means function that I can assign to aggfunc and that can do this.
It isn't really convenient (possible?) for me to simply create a weighted dataset because the weights add up to over 100 million. Also, ideally I'd like to use the aggfunc argument because I've got many more columns like education in my dataset that I'd like to compute weighted averages for, some of which have upwards of 25 categories.
I might be totally overlooking something here, but I'm stumped.
Upvotes: 3
Views: 4043
Reputation: 1501
I'm a big fan of pivot_table
, so here it goes a solution using it:
pivot = df.pivot_table(values='income',
index='education',
aggfunc=lambda rows: np.average(rows, weights=df.loc[rows.index, 'weights']))
The resulting dataframe will be as follows:
income
education
High 4.333333
Low 5.433333
Medium 8.056604
Upvotes: 5
Reputation: 46
I would add an additional column with the weighted sum. It will then be like this:
df = pd.DataFrame(data, columns=['education', 'income', 'weights'])
df['weighted'] = df['income'] * df['weights']
pivot_educ_inc = pd.pivot_table(df,
values=['weights', 'weighted'],
index='education',
aggfunc=np.sum)
pivot_educ_inc['weighted_avg'] = pivot_educ_inc['weighted'] / pivot_educ_inc['weights']
Upvotes: 1