Reputation: 11
I am trying to calculate the daily mean of some data that is originally given by hours, it looks like this:
Date pm25 Quality_pm25
576 2020-02-25 00:00:00 22.0 1.00
577 2020-02-25 01:00:00 21.0 1.00
578 2020-02-25 02:00:00 27.0 1.00
579 2020-02-25 03:00:00 24.0 1.00
580 2020-02-25 04:00:00 27.0 1.00
581 2020-02-25 05:00:00 27.0 1.00
582 2020-02-25 06:00:00 27.0 1.00
583 2020-02-25 07:00:00 37.0 1.00
584 2020-02-25 08:00:00 41.0 1.00
585 2020-02-25 09:00:00 -9999.0 4.31
586 2020-02-25 10:00:00 -9999.0 4.31
587 2020-02-25 11:00:00 -9999.0 4.31
588 2020-02-25 12:00:00 -9999.0 4.31
589 2020-02-25 13:00:00 -9999.0 4.31
590 2020-02-25 14:00:00 -9999.0 4.31
591 2020-02-25 15:00:00 -9999.0 4.31
592 2020-02-25 16:00:00 -9999.0 4.31
593 2020-02-25 17:00:00 -9999.0 4.31
594 2020-02-25 18:00:00 -9999.0 4.31
595 2020-02-25 19:00:00 -9999.0 4.31
596 2020-02-25 20:00:00 -9999.0 4.31
597 2020-02-25 21:00:00 -9999.0 4.31
598 2020-02-25 22:00:00 -9999.0 4.31
599 2020-02-25 23:00:00 -9999.0 4.31
I've been using this code and it works fine.
import numpy as np
import pandas as pd
df = pd.read_csv('Data/estacion_data_calidadaire_25_20200201_20200229.csv',
parse_dates=['Date'])
df = df[['Date', 'pm25', 'Quality_pm25']]
invalid = ((df['Quality_pm25'] >=2.6) | (df['pm25'] == -9999))
valid = (df[~invalid])
mean = valid.groupby(valid['Date'].dt.day)['pm25'].mean()
mean = round(mean, 0)
which results in:
Date
1 22.0
2 14.0
3 15.0
4 21.0
5 24.0
6 29.0
7 37.0
8 34.0
9 28.0
10 29.0
11 30.0
12 32.0
13 33.0
14 40.0
15 25.0
16 19.0
17 20.0
18 21.0
19 28.0
20 28.0
21 29.0
22 23.0
23 21.0
24 27.0
25 28.0
Now, I want to put a conditional for calculating that mean, if the number of hours with data is less than 18 after applying the following filter which determine which data is invalid:
invalid = ((df['Quality_pm25'] >=2.6) | (df['pm25'] == -9999))
For example, if I were to calculate the mean for 2020-02-25 (Data shown at the beggining), as there are only 9 hours out of 24 with valid data (as shown below) I'd expect a mean with value NaN.
Date pm25 Quality_pm25
576 2020-02-25 00:00:00 22.0 1.00
577 2020-02-25 01:00:00 21.0 1.00
578 2020-02-25 02:00:00 27.0 1.00
579 2020-02-25 03:00:00 24.0 1.00
580 2020-02-25 04:00:00 27.0 1.00
581 2020-02-25 05:00:00 27.0 1.00
582 2020-02-25 06:00:00 27.0 1.00
583 2020-02-25 07:00:00 37.0 1.00
584 2020-02-25 08:00:00 41.0 1.00
Desire result
Date
25 NaN
How can I do this? I hope it's clear what I want to do.
Upvotes: 0
Views: 97
Reputation: 11
I found a solution to my question like this:
mean = valid.groupby(valid['Date'].dt.day)['pm25_' + i].mean().where(valid.groupby(valid['Dia'].dt.day)['pm25_' + i].count().ge(18))
But now when the average is done there are some values that don't appear I tried to use .mean(skipna = False)
but it shows an error:
TypeError: mean() got an unexpected keyword argument 'skipna'
I looked up and .mean()
does have a skipna=None
keyword, I don't know what's wrong.
Upvotes: 1
Reputation: 802
You can use multiple aggregations calling a .agg({})
in groupby
resulting object.
Example:
from datetime import datetime as dt
...
invalido = ((df['calidad_pm25'] >=2.6) | (df['P_PM25'] == -9999))
valido = (df[~invalido])
agregaciones = {
"Dia": "count",
"P_PM25": "mean"
}
mean = valido.groupby(valido['Dia'].dt.day).agg(agregaciones)
solo_dias_con_al_menos_18_horas = mean[mean.Dia >= 18]
...
Tried to do my best in Spanish variable names.
Upvotes: 0