PEREZje
PEREZje

Reputation: 2502

How do I calculate the daily average across various classes of data in Pandas?

I have data of the following format:

   station_number                date  river_height  river_flow
0               1 2005-01-01 08:09:00      0.285233    0.782065
1               1 2005-01-01 11:28:12      0.129994    0.386652
2               4 2005-01-01 17:33:36      0.457168    0.167025
3               2 2005-01-01 23:21:00      0.359086    0.851716
4               4 2005-01-02 04:18:36      0.332998    0.830749
5               1 2005-01-02 09:28:12      0.867262    0.855507
6               3 2005-01-02 13:15:36      0.352409    0.023737
7               2 2005-01-02 17:31:12      0.696562    0.846762
8               1 2005-01-02 21:15:36      0.910944    0.096999
9               4 2005-01-03 02:13:12      0.981430    0.152109

I need to calculate a daily average of the river height and river flow per unique station number, so as a result something like this:

   station_number       date      river_height  river_flow
0               1 2005-01-01      0.285         0.782
1               1 2005-01-02      0.233         0.753
2               2 2005-01-01      0.129         0.386
3               2 2005-01-02      0.994         0.386
4               3 2005-01-01      0.457         0.167
5               3 2005-01-02      0.168         0.134
6               4 2005-01-01      0.356         0.321
7               4 2005-01-02      0.086         0.716

Keep in mind that the above numbers are random, and not actually the averages I'm looking for. I need an entry for each day for each station. I hope I have clarified what I need!

I have tried aggregating using groupby such as below:

monthly_flow_data_mean = df.groupby(pd.PeriodIndex(df['date'], freq="M"))['river_flow'].mean()

But this obviously just takes all river_flow measurements not considering the station numbers. I have had trouble finding what combination of groupby and aggregations I need to properly achieve what I need.

I tried this as well:

daily_flow_df = df.groupby(pd.PeriodIndex(df['date'], freq="D")).agg({"river_flow": "mean", "river_height": "mean", "station_number": "first"})

But I am pretty sure this also doesn't really work as we are not really using the station number to aggregate, but merely choosing how to aggregate it while aggregating all river flow measurements.

I can obviously also just split the dataframe into 4 classes and then do the aggregation per dataframe, and merge it back together. But I am wondering if there is some smart little groupby trick that can help me achieve this in less lines, as it will be useful later in my project(s) as well where I might have way more classes in the data.

Upvotes: 1

Views: 928

Answers (1)

Scott Boston
Scott Boston

Reputation: 153510

You can use either of the following solutions to groupby 'station_number' and date on the 'Date' column using pd.Grouper or dt.normalize:

df.groupby(['station_number', pd.Grouper(key='date', freq='D')]).mean()  

or

df.groupby(['station_number', df['date'].dt.normalize()]).mean() 

Upvotes: 1

Related Questions