Reputation: 752
I have found some answers about averaging dataframes, but none that includes the treatment of weights. I have figured a way to get to the result I want (see title) but I wonder if there is a more direct way of achieving the same goal.
EDIT: I need to average more than just two dataframes, however the example code below only includes two of them.
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
df2 = pd.DataFrame([[3, 1, np.nan, 1],
[2, 5, np.nan, 3],
[np.nan, 4, np.nan, 2],
[np.nan, 2, 1, 5]],
columns=list('ABCD'))
What I do is:
def fromDfToArraysStack(df):
for i in range(len(df)):
arrayRow = df.iloc[i].values
if i == 0:
arraysStack = arrayRow
else:
arraysStack = np.vstack((arraysStack, arrayRow))
return arraysStack
arraysStack1 = fromDfToArraysStack(df1)
arraysStack2 = fromDfToArraysStack(df2)
arrayOfArrays = np.array([arraysStack1, arraysStack2])
masked = np.ma.masked_array(arrayOfArrays,
np.isnan(arrayOfArrays))
arrayAve = np.ma.average(masked,
axis = 0,
weights = [1,2])
pd.DataFrame(np.row_stack(arrayAve.filled(np.nan)))
0 1 2 3
0 3.000000 1.333333 NaN 0.666667
1 2.333333 4.666667 NaN 2.333333
2 NaN 4.000000 NaN 3.000000
3 NaN 2.333333 1.0 4.666667
As I said this works, but hopefully there is a more concise way to do this, one-liner anybody ?
Upvotes: 3
Views: 366
Reputation: 786
Would this work for you? Its not a one liner but still a lot shorter :)
import pandas as pd
import numpy as np
df3 = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
df4 = pd.DataFrame([[3, 1, np.nan, 1],
[2, 5, np.nan, 3],
[np.nan, 4, np.nan, 2],
[np.nan, 2, 1, 5]],
columns=list('ABCD'))
weights = [1,2]
average = (df3*weights[0]+df4*weights[1])/sum(weights)
average[df3.isna()] = df4
average[df4.isna()] = df3
average
EDIT: Since pointed out that speed is of concern I provide optimised version below and some performance results. In the optimised version I convert dataframes to numpy arrays since it works faster there (as do you in your example):
import pandas as pd
import numpy as np
df3 = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
df4 = pd.DataFrame([[3, 1, np.nan, 1],
[2, 5, np.nan, 3],
[np.nan, 4, np.nan, 2],
[np.nan, 2, 1, 5]],
columns=list('ABCD'))
weights = np.array([1,2])
df3 = df3.values
df4 = df4.values
average = (df3*weights[0]+df4*weights[1])/np.sum(weights)
np.copyto(average,df4,where=np.isnan(df3))
np.copyto(average,df3,where=np.isnan(df4))
average
Timing results:
1.18 ms ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
18.4 µs ± 1.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Upvotes: 1
Reputation: 986
To make it a tidy one-line, I cheated a little with the imports, but here is the best I could do:
import pandas as pd
import numpy as np
from numpy.ma import average as avg
from numpy.ma import masked_array as ma
df1 = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
df2 = pd.DataFrame([[3, 1, np.nan, 1],
[2, 5, np.nan, 3],
[np.nan, 4, np.nan, 2],
[np.nan, 2, 1, 5]],
columns=list('ABCD'))
df1.combine(df2, lambda x, y: avg([ma(x, np.isnan(x)), ma(y, np.isnan(y))], 0, [1, 2]))
EDIT:
import pandas as pd
import numpy as np
from numpy.ma import average as avg
from numpy.ma import masked_array as ma
df1 = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
df2 = pd.DataFrame([[3, 1, np.nan, 1],
[2, 5, np.nan, 3],
[np.nan, 4, np.nan, 2],
[np.nan, 2, 1, 5]],
columns=list('ABCD'))
def df_average(dfs, wgts):
return pd.DataFrame(avg([ma(df.values, np.isnan(df.values)) for df in dfs], 0, wgts))
df_average(dfs=[df1, df2], wgts=[1, 2])
Upvotes: 1