Reputation: 77
I have two DataFrames, which I am trying to make a single containing the mean of the two. Each has missing values.
If there were not missing values I could use (df1 + df2)/2.
I would like to take the mean of those that have both data points, while returning 'NaN' as the 'mean' for the points that are missing data.
For example:
df1 = pd.DataFrame({'ID': ['Alpha', 'Bravo', 'Charlie', 'Delta'], 'Var1': [15,'NaN', 17, 18], 'Var2': [1.0, 1.5, 2.0, 1.5], 'Var3': [1, 0, 0, 1]})
df1 = df1.set_index('ID')
Var1 Var2 Var3
ID
Alpha 15 1.0 1
Bravo NaN 1.5 0
Charlie 17 2.0 0
Delta 18 1.5 1
_
df2 = pd.DataFrame({'ID': ['Alpha', 'Bravo', 'Charlie', 'Delta'], 'Var1': [20, 15, 17, 20], 'Var2': [1.2,'NaN', 3.0, 1.0], 'Var3': [0, 0, 1, 1]})
df2=df2.set_index('ID')
Var1 Var2 Var3
ID
Alpha 20 1.2 0
Bravo 15 NaN 0
Charlie 17 3 1
Delta 20 1 1
_
The resulting DataFrame should be:
Var1 Var2 Var3
ID
Alpha 17.5 1.10 0.5
Bravo NaN NaN 0.0
Charlie 17.0 2.50 0.5
Delta 19.0 1.25 1.0
So, in simple terms, my question is how can I do (df1 + df2)/2 while ignoring the points that have NaN's?
Upvotes: 2
Views: 117
Reputation: 323396
Your columns
type
for var1
in both df1
and df2
is object , that is because of the original data is [1.2,'NaN', 3.0, 1.0]
which is mixed type also 'NaN' is not np.NaN
.
df1 = pd.DataFrame({'ID': ['Alpha', 'Bravo', 'Charlie', 'Delta'], 'Var1': [15,np.NaN, 17, 18], 'Var2': [1.0, 1.5, 2.0, 1.5], 'Var3': [1, 0, 0, 1]})
df1 = df1.set_index('ID')
df2 = pd.DataFrame({'ID': ['Alpha', 'Bravo', 'Charlie', 'Delta'], 'Var1': [20, 15, 17, 20], 'Var2': [1.2,np.NaN, 3.0, 1.0], 'Var3': [0, 0, 1, 1]})
df2=df2.set_index('ID')
df1.add(df2)/2
Out[109]:
Var1 Var2 Var3
ID
Alpha 17.5 1.10 0.5
Bravo NaN NaN 0.0
Charlie 17.0 2.50 0.5
Delta 19.0 1.25 1.0
Upvotes: 3
Reputation: 29740
You can do exactly that, (df1 + df2) / 2
.
Your real problem here is that the NaN-containing columns in your DataFrames are of object
dtype, not floating datatypes. Fix that, and the above method works. Ideally fix that by having np.nan
in your inputs, or alternatively force the column dtypes to numeric later.
df1 = df1.apply(pd.to_numeric, errors='coerce')
df2 = df2.apply(pd.to_numeric, errors='coerce')
Upvotes: 5
Reputation: 164843
You can use NumPy to calculate the mean of an array of arrays.
arr = np.array([df1.replace('NaN', np.nan).values,
df2.replace('NaN', np.nan).values])
res = pd.DataFrame(arr.mean(0), index=df1.index, columns=df1.columns)
print(res)
Var1 Var2 Var3
ID
Alpha 17.5 1.10 0.5
Bravo NaN NaN 0.0
Charlie 17.0 2.50 0.5
Delta 19.0 1.25 1.0
Explanation
This solution requires the following steps:
'NaN'
to np.nan
.axis=0
.Upvotes: 2