Reputation: 6038
I have the following table. I want to calculate a weighted average grouped by each date based on the formula below. I can do this using some standard conventional code, but assuming that this data is in a pandas dataframe, is there any easier way to achieve this rather than through iteration?
Date ID wt value w_avg
01/01/2012 100 0.50 60 0.791666667
01/01/2012 101 0.75 80
01/01/2012 102 1.00 100
01/02/2012 201 0.50 100 0.722222222
01/02/2012 202 1.00 80
01/01/2012 w_avg = 0.5 * ( 60/ sum(60,80,100)) + .75 * (80/ sum(60,80,100)) + 1.0 * (100/sum(60,80,100))
01/02/2012 w_avg = 0.5 * ( 100/ sum(100,80)) + 1.0 * ( 80/ sum(100,80))
Upvotes: 72
Views: 181799
Reputation: 2639
Another option is the following, which works for many columns, keeps track of the weights and should be fast (have not profiled it).
def _weighted_sum(df, weight_column: str = "weights"):
"""Performs a weighted sum grouped by index."""
cols = [c for c in df.columns if c != weight_column]
df.loc[:, cols] = df.loc[:, cols].multiply(df[weight_column], axis=0)
agg = df.groupby(df.index).sum()
agg.loc[:, cols] = agg.loc[:, cols].divide(agg[weight_column], axis=0)
return agg
def test_weighted_sum():
df = pd.DataFrame(
{
"index": [1, 1, 2, 2],
"x": [0, 1, 2, 100],
"y": [4, 5, 6, 7],
"weights": [1, 1, 1, 9],
}
)
df.set_index("index", inplace=True)
result = _weighted_sum(df)
expected_result = pd.DataFrame(
{
"index": [1, 2],
"x": [0.5, 90.2],
"y": [4.5, 6.9],
"weights": [2, 10],
}
)
expected_result.set_index("index", inplace=True)
pd.testing.assert_frame_equal(result, expected_result)
Upvotes: 1
Reputation: 28644
Another option to an old question (using @kadee sample data):
(df.drop(columns='ID')
.assign(wt = lambda df: df.prod(1)) # product of weight and value
.groupby('Date').sum()
.pipe(lambda df: df.wt / df.value) # weighted average computation
)
Date
01/01/2012 0.791667
01/02/2012 0.722222
Should be much faster if the computation is done without the use of anonymous functions:
temp = df.drop(columns='ID')
temp = temp.assign(wt = temp.prod(1)).groupby('Date').sum()
temp.wt / temp.value
Date
01/01/2012 0.791667
01/02/2012 0.722222
dtype: float64
Upvotes: 1
Reputation: 7671
If speed is an important factor for you, vectorizing is critical. Thus, based on the answer by Andy Hayden, here is a solution using only Pandas native functions:
def weighted_mean(df, values, weights, groupby):
df = df.copy()
grouped = df.groupby(groupby)
df['weighted_average'] = df[values] / grouped[weights].transform('sum') * df[weights]
return grouped['weighted_average'].sum(min_count=1) #min_count is required for Grouper objects
In comparison, using a custom lambda
function is less code, but slower:
import numpy as np
def weighted_mean_by_lambda(df, values, weights, groupby):
return df.groupby(groupby).apply(lambda x: np.average(x[values], weights=x[weights]))
Speed test:
import time
import numpy as np
import pandas as pd
n = 100000000
df = pd.DataFrame({
'values': np.random.uniform(0, 1, size=n),
'weights': np.random.randint(0, 5, size=n),
'groupby': np.random.randint(0, 10000, size=n),
})
time1 = time.time()
weighted_mean(df, 'values', 'weights', 'groupby')
print('Time for `weighted_mean`:', time.time() - time1)
time2 = time.time()
weighted_mean_by_lambda(df, 'values', 'weights', 'groupby')
print('Time for `weighted_mean_by_lambda`:', time.time() - time2)
Speed test output:
Time for `weighted_mean`: 3.4519572257995605
Time for `weighted_mean_by_lambda`: 11.41335940361023
Upvotes: 12
Reputation: 1895
I saved the table in the .csv file
df=pd.read_csv('book1.csv')
grouped=df.groupby('Date')
g_wavg= lambda x: np.average(x.wt, weights=x.value)
grouped.apply(g_wavg)
Upvotes: 11
Reputation: 389
I feel the following is an elegant solution to this problem from:(Pandas DataFrame aggregate function using multiple columns)
grouped = df.groupby('Date')
def wavg(group):
d = group['value']
w = group['wt']
return (d * w).sum() / w.sum()
grouped.apply(wavg)
Upvotes: 17
Reputation: 8864
Let's first create the example pandas dataframe:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: index = pd.Index(['01/01/2012','01/01/2012','01/01/2012','01/02/2012','01/02/2012'], name='Date')
In [4]: df = pd.DataFrame({'ID':[100,101,102,201,202],'wt':[.5,.75,1,.5,1],'value':[60,80,100,100,80]},index=index)
Then, the average of 'wt' weighted by 'value' and grouped by the index is obtained as:
In [5]: df.groupby(df.index).apply(lambda x: np.average(x.wt, weights=x.value))
Out[5]:
Date
01/01/2012 0.791667
01/02/2012 0.722222
dtype: float64
Alternatively, one can also define a function:
In [5]: def grouped_weighted_avg(values, weights, by):
...: return (values * weights).groupby(by).sum() / weights.groupby(by).sum()
In [6]: grouped_weighted_avg(values=df.wt, weights=df.value, by=df.index)
Out[6]:
Date
01/01/2012 0.791667
01/02/2012 0.722222
dtype: float64
Upvotes: 68
Reputation: 375395
I think I would do this with two groupbys.
First to calculate the "weighted average":
In [11]: g = df.groupby('Date')
In [12]: df.value / g.value.transform("sum") * df.wt
Out[12]:
0 0.125000
1 0.250000
2 0.416667
3 0.277778
4 0.444444
dtype: float64
If you set this as a column, you can groupby over it:
In [13]: df['wa'] = df.value / g.value.transform("sum") * df.wt
Now the sum of this column is the desired:
In [14]: g.wa.sum()
Out[14]:
Date
01/01/2012 0.791667
01/02/2012 0.722222
Name: wa, dtype: float64
or potentially:
In [15]: g.wa.transform("sum")
Out[15]:
0 0.791667
1 0.791667
2 0.791667
3 0.722222
4 0.722222
Name: wa, dtype: float64
Upvotes: 34