MANUEL RODRÍGUEZ
MANUEL RODRÍGUEZ

Reputation: 11

Conditional mean

I am trying to calculate the daily mean of some data that is originally given by hours, it looks like this:

    Date                pm25    Quality_pm25
576 2020-02-25 00:00:00 22.0    1.00
577 2020-02-25 01:00:00 21.0    1.00
578 2020-02-25 02:00:00 27.0    1.00
579 2020-02-25 03:00:00 24.0    1.00
580 2020-02-25 04:00:00 27.0    1.00
581 2020-02-25 05:00:00 27.0    1.00
582 2020-02-25 06:00:00 27.0    1.00
583 2020-02-25 07:00:00 37.0    1.00
584 2020-02-25 08:00:00 41.0    1.00
585 2020-02-25 09:00:00 -9999.0 4.31
586 2020-02-25 10:00:00 -9999.0 4.31
587 2020-02-25 11:00:00 -9999.0 4.31
588 2020-02-25 12:00:00 -9999.0 4.31
589 2020-02-25 13:00:00 -9999.0 4.31
590 2020-02-25 14:00:00 -9999.0 4.31
591 2020-02-25 15:00:00 -9999.0 4.31
592 2020-02-25 16:00:00 -9999.0 4.31
593 2020-02-25 17:00:00 -9999.0 4.31
594 2020-02-25 18:00:00 -9999.0 4.31
595 2020-02-25 19:00:00 -9999.0 4.31
596 2020-02-25 20:00:00 -9999.0 4.31
597 2020-02-25 21:00:00 -9999.0 4.31
598 2020-02-25 22:00:00 -9999.0 4.31
599 2020-02-25 23:00:00 -9999.0 4.31

I've been using this code and it works fine.

import numpy as np
import pandas as pd

df = pd.read_csv('Data/estacion_data_calidadaire_25_20200201_20200229.csv',
                parse_dates=['Date'])

df = df[['Date', 'pm25', 'Quality_pm25']]

invalid = ((df['Quality_pm25'] >=2.6) | (df['pm25'] == -9999))

valid = (df[~invalid])

mean = valid.groupby(valid['Date'].dt.day)['pm25'].mean()

mean = round(mean, 0)

which results in:

Date
1     22.0
2     14.0
3     15.0
4     21.0
5     24.0
6     29.0
7     37.0
8     34.0
9     28.0
10    29.0
11    30.0
12    32.0
13    33.0
14    40.0
15    25.0
16    19.0
17    20.0
18    21.0
19    28.0
20    28.0
21    29.0
22    23.0
23    21.0
24    27.0
25    28.0

Now, I want to put a conditional for calculating that mean, if the number of hours with data is less than 18 after applying the following filter which determine which data is invalid:

invalid = ((df['Quality_pm25'] >=2.6) | (df['pm25'] == -9999))

For example, if I were to calculate the mean for 2020-02-25 (Data shown at the beggining), as there are only 9 hours out of 24 with valid data (as shown below) I'd expect a mean with value NaN.

        Date                pm25    Quality_pm25
    576 2020-02-25 00:00:00 22.0    1.00
    577 2020-02-25 01:00:00 21.0    1.00
    578 2020-02-25 02:00:00 27.0    1.00
    579 2020-02-25 03:00:00 24.0    1.00
    580 2020-02-25 04:00:00 27.0    1.00
    581 2020-02-25 05:00:00 27.0    1.00
    582 2020-02-25 06:00:00 27.0    1.00
    583 2020-02-25 07:00:00 37.0    1.00
    584 2020-02-25 08:00:00 41.0    1.00

Desire result

Date                
25     NaN

How can I do this? I hope it's clear what I want to do.

Upvotes: 0

Views: 97

Answers (2)

MANUEL RODRÍGUEZ
MANUEL RODRÍGUEZ

Reputation: 11

I found a solution to my question like this:

mean = valid.groupby(valid['Date'].dt.day)['pm25_' + i].mean().where(valid.groupby(valid['Dia'].dt.day)['pm25_' + i].count().ge(18))

But now when the average is done there are some values that don't appear I tried to use .mean(skipna = False) but it shows an error:

TypeError: mean() got an unexpected keyword argument 'skipna'

I looked up and .mean() does have a skipna=None keyword, I don't know what's wrong.

Upvotes: 1

Victor Maricato
Victor Maricato

Reputation: 802

You can use multiple aggregations calling a .agg({}) in groupby resulting object.

Example:

from datetime import datetime as dt
...
invalido = ((df['calidad_pm25'] >=2.6) | (df['P_PM25'] == -9999))

valido = (df[~invalido])

agregaciones = {
    "Dia": "count",
    "P_PM25": "mean"
}

mean = valido.groupby(valido['Dia'].dt.day).agg(agregaciones)

solo_dias_con_al_menos_18_horas = mean[mean.Dia >= 18]
...

Tried to do my best in Spanish variable names.

Upvotes: 0

Related Questions