Pandas df.resample(): Specify NaN threshold for calculation of mean

Question

I want to resample a pandas dataframe from hourly to annual/daily frequency with the how=mean method. However, of course some hourly data are missing during the year.

How can I set a threshold for the ratio of allowed NaNs before the mean is set to NaN, too? I couldn't find anything considering that in the docs...

Thanks in advance!

Romain · Accepted Answer

Here is a simple solution using groupby.

# Test data
start_date = pd.to_datetime('2015-01-01')
pd.date_range(start=start_date, periods=365*24, freq='H')
number = 365*24
df = pd.DataFrame(np.random.randint(1,10, number),index=pd.date_range(start=start_date, periods=number, freq='H'), columns=['values'])
# Generating some NaN to simulate less values on the first day
na_range = pd.date_range(start=start_date, end=start_date +  3 * Hour(), freq='H')
df.loc[na_range,'values'] = np.NaN

# grouping by day, computing the mean and the count
df = df.groupby(df.index.date).agg(['mean', 'count'])
df.columns = df.columns.droplevel()

# Populating the mean only if the number of values (count) is > to the threshold
df['values'] = np.NaN
df.loc[df['count']>=20, 'values'] = df['mean']
print(df.head)

# Result
                mean  count  values
2015-01-01  4.947368     20     NaN
2015-01-02  5.125000     24   5.125
2015-01-03  4.875000     24   4.875
2015-01-04  5.750000     24   5.750
2015-01-05  4.875000     24   4.875

Pandas df.resample(): Specify NaN threshold for calculation of mean

Answers (2)

Related Questions