Reputation: 1170
I have a dataframe full of historical data where not every row actually has a "complete" data set. In the example below, data was only collected for parameter "C" in a single row. I'd like to be able to get an average of this data set where I only perform the average if the total number of real samples is > n.
import pandas as pd
import numpy as np
columns = ['A', 'B', 'C']
rows = range(5)
data = [
[5, 2, np.NaN],
[2, 2, np.NaN],
[4, 4, 3],
[1, 2, np.NaN],
[1, np.NaN, np.NaN]
]
history = pd.DataFrame(columns=columns, index=rows, data=data)
non_nan_values = history.notnull().sum() # -> A: 5, B: 4, C: 1
means = history.mean() # -> A: 2.6, B: 2.5, C: 3.0
If I were to attempt history[history.notnull().sum() > 3].mean()
I get an IndexingError as this isn't actually a valid mask.
Aside from iterating over the mean variable retroactively and replacing certain values with NaN depending on my non_nan_values variable, is there a better solution to this? The target result would be a mean variable that looks like this:
A: 2.6, B: 2.5, C: NaN
Upvotes: 0
Views: 29
Reputation: 215117
If all the columns are numeric, you can use the null count to mask the mean
after it's calculated instead of before:
history.mean().where(history.notnull().sum() > 3)
#A 2.6
#B 2.5
#C NaN
#dtype: float64
Convert the result to dictionary is easy with to_dict
:
history.mean().where(history.notnull().sum() > 3).to_dict()
# {'B': 2.5, 'C': nan, 'A': 2.6000000000000001}
Upvotes: 1