Tijmen Stronks
Tijmen Stronks

Reputation: 49

Conditional groupby in python

I'm working with a dataframe called Ozon. It has a column called O3 and a DatetimeIndex. Now I created a new dataframe so that I would get the mean ozon values for each day:

dailymeanozon = ozon.groupby(pd.Grouper(freq='1D')).mean()

This works fine, however for some days, the amount of (correct) ozon measurements is really small. So for some days, the daily mean ozon value isn't representative for the entire day. Therefore I want the groupby function to exclude days that have less than X amount of values. Or what may also work: to exclude the day if it has to many NaN values.

It seems like a simple question, but I'm quite new to python so I couldn't figure it out. So all help is very much appreciated!

Upvotes: 0

Views: 243

Answers (1)

Gijs Wobben
Gijs Wobben

Reputation: 2060

There are several ways to solve this, but this might be one of the simplest: Count the number of measurements per day, and use that information to filter the daily mean dataframe. The .count function from pandas has the property that it only counts rows that are not None, which is basically what you want. This should give you an impression:

import pandas
import random

# Some test data
ozon = pandas.DataFrame({
    "date": pandas.date_range(start="1/1/2020", end="14/1/2020", freq="H")
})
ozon["O3"] = [random.randint(0, 10) for _ in range(ozon.shape[0])]
ozon.loc[ozon.sample(frac=0.5, replace=True, random_state=1).index.values, "O3"] = None
ozon = ozon.set_index("date")

# Specify how many non-Na values should be available
min_number_measurements = 14

# Compute count of group, excluding missing values
daily_counts = ozon.groupby(pandas.Grouper(freq="1D")).count()
daily_counts = daily_counts.loc[daily_counts["O3"] >= min_number_measurements]

# Your function
daily_mean_ozon = ozon.groupby(pandas.Grouper(freq="1D")).mean()

# Select the daily mean ozon lines for which the day had at least min_number_measurements
daily_mean_ozon = daily_mean_ozon.loc[daily_counts.loc[daily_counts["O3"] >= min_number_measurements].index]

Upvotes: 1

Related Questions