jhurst5
jhurst5

Reputation: 77

Get the mean of two DataFrames with missing values

I have two DataFrames, which I am trying to make a single containing the mean of the two. Each has missing values.

If there were not missing values I could use (df1 + df2)/2.

I would like to take the mean of those that have both data points, while returning 'NaN' as the 'mean' for the points that are missing data.

For example:

df1 = pd.DataFrame({'ID': ['Alpha', 'Bravo', 'Charlie', 'Delta'], 'Var1': [15,'NaN', 17, 18], 'Var2': [1.0, 1.5, 2.0, 1.5], 'Var3': [1, 0, 0, 1]})   

df1 = df1.set_index('ID')

        Var1  Var2  Var3
ID                      
Alpha     15   1.0     1
Bravo    NaN   1.5     0
Charlie   17   2.0     0
Delta     18   1.5     1

_

df2 = pd.DataFrame({'ID': ['Alpha', 'Bravo', 'Charlie', 'Delta'], 'Var1': [20, 15, 17, 20], 'Var2': [1.2,'NaN', 3.0, 1.0], 'Var3': [0, 0, 1, 1]})

df2=df2.set_index('ID')

         Var1 Var2  Var3
ID                      
Alpha      20  1.2     0
Bravo      15  NaN     0
Charlie    17    3     1
Delta      20    1     1

_

The resulting DataFrame should be:

         Var1   Var2  Var3
ID                      
Alpha      17.5  1.10  0.5
Bravo      NaN   NaN   0.0
Charlie    17.0  2.50  0.5
Delta      19.0  1.25  1.0

So, in simple terms, my question is how can I do (df1 + df2)/2 while ignoring the points that have NaN's?

Upvotes: 2

Views: 117

Answers (3)

BENY
BENY

Reputation: 323396

Your columns type for var1 in both df1 and df2 is object , that is because of the original data is [1.2,'NaN', 3.0, 1.0] which is mixed type also 'NaN' is not np.NaN .

df1 = pd.DataFrame({'ID': ['Alpha', 'Bravo', 'Charlie', 'Delta'], 'Var1': [15,np.NaN, 17, 18], 'Var2': [1.0, 1.5, 2.0, 1.5], 'Var3': [1, 0, 0, 1]})   
df1 = df1.set_index('ID')
df2 = pd.DataFrame({'ID': ['Alpha', 'Bravo', 'Charlie', 'Delta'], 'Var1': [20, 15, 17, 20], 'Var2': [1.2,np.NaN, 3.0, 1.0], 'Var3': [0, 0, 1, 1]})
df2=df2.set_index('ID')
df1.add(df2)/2
Out[109]: 
         Var1  Var2  Var3
ID                       
Alpha    17.5  1.10   0.5
Bravo     NaN   NaN   0.0
Charlie  17.0  2.50   0.5
Delta    19.0  1.25   1.0

Upvotes: 3

miradulo
miradulo

Reputation: 29740

You can do exactly that, (df1 + df2) / 2.

Your real problem here is that the NaN-containing columns in your DataFrames are of object dtype, not floating datatypes. Fix that, and the above method works. Ideally fix that by having np.nan in your inputs, or alternatively force the column dtypes to numeric later.

df1 = df1.apply(pd.to_numeric, errors='coerce')
df2 = df2.apply(pd.to_numeric, errors='coerce')

Upvotes: 5

jpp
jpp

Reputation: 164843

You can use NumPy to calculate the mean of an array of arrays.

arr = np.array([df1.replace('NaN', np.nan).values,
                df2.replace('NaN', np.nan).values])

res = pd.DataFrame(arr.mean(0), index=df1.index, columns=df1.columns)

print(res)

         Var1  Var2  Var3
ID                       
Alpha    17.5  1.10   0.5
Bravo     NaN   NaN   0.0
Charlie  17.0  2.50   0.5
Delta    19.0  1.25   1.0 

Explanation

This solution requires the following steps:

  1. Convert strings 'NaN' to np.nan.
  2. Combine 2 arrays into an array of arrays.
  3. Compute mean over axis=0.
  4. Construct dataframe, feeding index and columns from inputs.

Upvotes: 2

Related Questions