Reputation: 4030
I have a situation like Pandas Group Weighted Average of Multiple Columns but where some values of one column are sometimes NaN.
That is, I am doing the following:
import pandas as pd
import numpy as np
df=pd.DataFrame({'category':['a','a','b','b'],
'var1':np.random.randint(0,100,4),
'var2':np.random.randint(0,100,4),
'weights':np.random.randint(0,10,4)})
df.loc[1,'var1']=np.nan
df
category var1 var2 weights
0 a 74.0 99 9
1 a NaN 8 4
2 b 13.0 86 2
3 b 49.0 38 7
def weighted(x, cols, w="weights"):
# Following fails when NaNs might be present:
#return pd.Series(np.average(x[cols], weights=x[w], axis=0), cols)
return pd.Series([np.nan if x.dropna(subset=[c]).empty else np.average(x.dropna(subset=[c])[c], weights =x.dropna(subset=[c])[w] ) for c in cols], cols)
df.groupby('category').apply(weighted, ['var1', 'var2'])
var1 var2
category
a 74.0 57.846154
b 23.0 8.000000
I'd like a nicer way to do this, but np.nanmean does not allow weights. np.average does not allow options to control treatment of NaNs.
Upvotes: 4
Views: 467
Reputation: 4030
With no cleaner answer than my proposal, I am suggesting that using the function below is not so bad:
import pandas as pd
import numpy as np
def weighted_means_by_column_ignoring_NaNs(x, cols, w="weights"):
""" This takes a DataFrame and averages each data column (cols),
weighting observations by column w, but ignoring individual NaN
observations within each column.
"""
return pd.Series([np.nan if x.dropna(subset=[c]).empty else \
np.average(x.dropna(subset=[c])[c],
weights =x.dropna(subset=[c])[w] ) \
for c in cols], cols)
An example usage is as follows
df=pd.DataFrame({'category':['a','a','b','b'],
'var1':np.random.randint(0,100,4),
'var2':np.random.randint(0,100,4),
'weights':np.random.randint(0,10,4)})
df.loc[1,'var1']=np.nan
df
category var1 var2 weights
0 a 74.0 99 9
1 a NaN 8 4
2 b 13.0 86 2
3 b 49.0 38 7
df.groupby('category').apply(weighted_means_by_column_ignoring_NaNs),
['var1', 'var2'])
var1 var2
category
a 74.0 57.846154
b 23.0 8.000000
Upvotes: 3
Reputation: 25239
You may pre-process your dataframe with melt
, and dropna
before call apply
and call unstack
wa=lambda x: np.average(x.value, weights=x.weights)
df_avg = (df.melt(['category', 'weights']).dropna().groupby(['category', 'variable'])
.apply(wa).unstack())
Out[40]:
variable var1 var2
category
a 74.0 71.000000
b 41.0 48.666667
Note: your desired output doesn't match with the sample. The value of (a, 'var2')
is (99 * 9 + 8 * 4) / (9 + 4) = 71
Upvotes: 0
Reputation: 475
How about set Nan value to zero and create a new column which is var * weight
. Then you can use groupby
to get the result.
Upvotes: 0