Reputation: 1201
Is there a way to write an aggregation function as is used in DataFrame.agg
method, that would have access to more than one column of the data that is being aggregated? Typical use cases would be weighted average, weighted standard deviation funcs.
I would like to be able to write something like
def wAvg(c, w):
return ((c * w).sum() / w.sum())
df = DataFrame(....) # df has columns c and w, i want weighted average
# of c using w as weight.
df.aggregate ({"c": wAvg}) # and somehow tell it to use w column as weights ...
Upvotes: 110
Views: 68516
Reputation: 2822
Adding to Wes MacKinney answer, this will rename the aggregated column:
grouped = df.groupby(keys)
def wavg(group):
d = group['data']
w = group['weights']
return (d * w).sum() / w.sum()
grouped.apply(wavg).reset_index().rename(columns={0 : "wavg"})
Upvotes: 0
Reputation: 17824
You can implement this function in the following way:
(df['c'] * df['w']).groupby(df['groups']).sum() / df.groupby('groups')['w'].sum()
For example:
df = pd.DataFrame({'groups': [1, 1, 2, 2], 'c': [3, 3, 4, 4], 'w': [5, 5, 6, 6]})
(df['c'] * df['w']).groupby(df['groups']).sum() / df.groupby('groups')['w'].sum()
Result:
groups
1 3.0
2 4.0
dtype: float64
Upvotes: 1
Reputation: 631
Here's a solution which has the following benefits:
:
df.groupby('group')
.apply(lambda x: pd.Series({
'weighted_average': np.average(x.data, weights = x.weights)})
You can also use the same code to perform multiple aggregations:
df.groupby('group')
.apply(lambda x: pd.Series({
'weighted_average': np.average(x.data, weights = x.weights),
'regular_average': np.average(x.data)}))
Upvotes: 12
Reputation: 2967
My solution is similar to Nathaniel's solution, only it's for a single column and I don't deep-copy the entire data frame each time, which could be prohibitively slow. The performance gain over the solution groupby(...).apply(...) is about 100x(!)
def weighted_average(df, data_col, weight_col, by_col):
df['_data_times_weight'] = df[data_col] * df[weight_col]
df['_weight_where_notnull'] = df[weight_col] * pd.notnull(df[data_col])
g = df.groupby(by_col)
result = g['_data_times_weight'].sum() / g['_weight_where_notnull'].sum()
del df['_data_times_weight'], df['_weight_where_notnull']
return result
Upvotes: 10
Reputation: 61967
It is possible to return any number of aggregated values from a groupby object with apply
. Simply, return a Series and the index values will become the new column names.
Let's see a quick example:
df = pd.DataFrame({'group':['a','a','b','b'],
'd1':[5,10,100,30],
'd2':[7,1,3,20],
'weights':[.2,.8, .4, .6]},
columns=['group', 'd1', 'd2', 'weights'])
df
group d1 d2 weights
0 a 5 7 0.2
1 a 10 1 0.8
2 b 100 3 0.4
3 b 30 20 0.6
Define a custom function that will be passed to apply
. It implicitly accepts a DataFrame - meaning the data
parameter is a DataFrame. Notice how it uses multiple columns, which is not possible with the agg
groupby method:
def weighted_average(data):
d = {}
d['d1_wa'] = np.average(data['d1'], weights=data['weights'])
d['d2_wa'] = np.average(data['d2'], weights=data['weights'])
return pd.Series(d)
Call the groupby apply
method with our custom function:
df.groupby('group').apply(weighted_average)
d1_wa d2_wa
group
a 9.0 2.2
b 58.0 13.2
You can get better performance by precalculating the weighted totals into new DataFrame columns as explained in other answers and avoid using apply
altogether.
Upvotes: 16
Reputation: 105531
Yes; use the .apply(...)
function, which will be called on each sub-DataFrame
. For example:
grouped = df.groupby(keys)
def wavg(group):
d = group['data']
w = group['weights']
return (d * w).sum() / w.sum()
grouped.apply(wavg)
Upvotes: 130
Reputation: 51
Accomplishing this via groupby(...).apply(...)
is non-performant. Here's a solution that I use all the time (essentially using kalu's logic).
def grouped_weighted_average(self, values, weights, *groupby_args, **groupby_kwargs):
"""
:param values: column(s) to take the average of
:param weights_col: column to weight on
:param group_args: args to pass into groupby (e.g. the level you want to group on)
:param group_kwargs: kwargs to pass into groupby
:return: pandas.Series or pandas.DataFrame
"""
if isinstance(values, str):
values = [values]
ss = []
for value_col in values:
df = self.copy()
prod_name = 'prod_{v}_{w}'.format(v=value_col, w=weights)
weights_name = 'weights_{w}'.format(w=weights)
df[prod_name] = df[value_col] * df[weights]
df[weights_name] = df[weights].where(~df[prod_name].isnull())
df = df.groupby(*groupby_args, **groupby_kwargs).sum()
s = df[prod_name] / df[weights_name]
s.name = value_col
ss.append(s)
df = pd.concat(ss, axis=1) if len(ss) > 1 else ss[0]
return df
pandas.DataFrame.grouped_weighted_average = grouped_weighted_average
Upvotes: 4
Reputation: 4605
I do this a lot and found the following quite handy:
def weighed_average(grp):
return grp._get_numeric_data().multiply(grp['COUNT'], axis=0).sum()/grp['COUNT'].sum()
df.groupby('SOME_COL').apply(weighed_average)
This will compute the weighted average of all the numerical columns in the df
and drop non-numeric ones.
Upvotes: 5
Reputation: 845
The following (based on Wes McKinney' answer) accomplishes exactly what I was looking for. I'd be happy to learn if there's a simpler way of doing this within pandas
.
def wavg_func(datacol, weightscol):
def wavg(group):
dd = group[datacol]
ww = group[weightscol] * 1.0
return (dd * ww).sum() / ww.sum()
return wavg
def df_wavg(df, groupbycol, weightscol):
grouped = df.groupby(groupbycol)
df_ret = grouped.agg({weightscol:sum})
datacols = [cc for cc in df.columns if cc not in [groupbycol, weightscol]]
for dcol in datacols:
try:
wavg_f = wavg_func(dcol, weightscol)
df_ret[dcol] = grouped.apply(wavg_f)
except TypeError: # handle non-numeric columns
df_ret[dcol] = grouped.agg({dcol:min})
return df_ret
The function df_wavg()
returns a dataframe that's grouped by the "groupby" column, and that returns the sum of the weights for the weights column. Other columns are either the weighted averages or, if non-numeric, the min()
function is used for aggregation.
Upvotes: 4