Reputation: 49
I'm working with a dataframe called Ozon
. It has a column called O3
and a DatetimeIndex. Now I created a new dataframe so that I would get the mean ozon values for each day:
dailymeanozon = ozon.groupby(pd.Grouper(freq='1D')).mean()
This works fine, however for some days, the amount of (correct) ozon measurements is really small. So for some days, the daily mean ozon value isn't representative for the entire day. Therefore I want the groupby
function to exclude days that have less than X amount of values. Or what may also work: to exclude the day if it has to many NaN values.
It seems like a simple question, but I'm quite new to python so I couldn't figure it out. So all help is very much appreciated!
Upvotes: 0
Views: 243
Reputation: 2060
There are several ways to solve this, but this might be one of the simplest: Count the number of measurements per day, and use that information to filter the daily mean dataframe. The .count
function from pandas
has the property that it only counts rows that are not None
, which is basically what you want. This should give you an impression:
import pandas
import random
# Some test data
ozon = pandas.DataFrame({
"date": pandas.date_range(start="1/1/2020", end="14/1/2020", freq="H")
})
ozon["O3"] = [random.randint(0, 10) for _ in range(ozon.shape[0])]
ozon.loc[ozon.sample(frac=0.5, replace=True, random_state=1).index.values, "O3"] = None
ozon = ozon.set_index("date")
# Specify how many non-Na values should be available
min_number_measurements = 14
# Compute count of group, excluding missing values
daily_counts = ozon.groupby(pandas.Grouper(freq="1D")).count()
daily_counts = daily_counts.loc[daily_counts["O3"] >= min_number_measurements]
# Your function
daily_mean_ozon = ozon.groupby(pandas.Grouper(freq="1D")).mean()
# Select the daily mean ozon lines for which the day had at least min_number_measurements
daily_mean_ozon = daily_mean_ozon.loc[daily_counts.loc[daily_counts["O3"] >= min_number_measurements].index]
Upvotes: 1