Average Dataframe using Summary Information

Question

I have a dataframe full of historical data where not every row actually has a "complete" data set. In the example below, data was only collected for parameter "C" in a single row. I'd like to be able to get an average of this data set where I only perform the average if the total number of real samples is > n.

import pandas as pd
import numpy as np

columns = ['A', 'B', 'C']
rows = range(5)
data = [
    [5, 2, np.NaN],
    [2, 2, np.NaN],
    [4, 4, 3],
    [1, 2, np.NaN],
    [1, np.NaN, np.NaN]
]

history = pd.DataFrame(columns=columns, index=rows, data=data)

non_nan_values = history.notnull().sum()  # -> A: 5, B: 4, C: 1
means = history.mean()  # -> A: 2.6, B: 2.5, C: 3.0

If I were to attempt history[history.notnull().sum() > 3].mean() I get an IndexingError as this isn't actually a valid mask.

Aside from iterating over the mean variable retroactively and replacing certain values with NaN depending on my non_nan_values variable, is there a better solution to this? The target result would be a mean variable that looks like this:

A: 2.6, B: 2.5, C: NaN

akuiper · Accepted Answer

If all the columns are numeric, you can use the null count to mask the mean after it's calculated instead of before:

history.mean().where(history.notnull().sum() > 3)
#A    2.6
#B    2.5
#C    NaN
#dtype: float64

Convert the result to dictionary is easy with to_dict:

history.mean().where(history.notnull().sum() > 3).to_dict()
# {'B': 2.5, 'C': nan, 'A': 2.6000000000000001}

Average Dataframe using Summary Information

Answers (1)

Related Questions