Wenche Aas
Wenche Aas

Reputation: 21

aggregate irregular data to monthly averages weighing for days in months,using python

I am struggling to average my data into monthly mean taking into account that the data can represent two months (and years), and there are empty periods. Thus a need to weigh the samples accordantly, i.e. a sample with 2 days in one month and 5 in the next should have 2/7 and 5/7 weight, respectively. I have tried to first reindex to daily data and then resample, but problems with empty periods (do not want to fill these) and irregular sampling periods. Surly I can calculate the duration for each periods (end-start), but not sure if that helps because also need to know the split between months.

Here is part of the data:

start               end                 conc
26.12.2019 07:00    03.01.2020 07:00    0.235
03.01.2020 07:00    14.01.2020 07:00    nan
14.01.2020 07:00    20.01.2020 07:00    0.032
21.01.2020 07:00    29.01.2020 07:00    1.601
29.01.2020 07:00    06.02.2020 07:00    0.815
06.02.2020 07:00    10.02.2020 07:00    nan
10.02.2020 07:00    20.02.2020 07:00    3.029
20.02.2020 07:00    28.02.2020 07:00    2.158
28.02.2020 07:00    09.03.2020 07:00    0.784
09.03.2020 07:00    18.03.2020 07:00    0.304

The results should be:

jan.2020    0.829
feb.2020    2.128

Upvotes: 2

Views: 203

Answers (1)

Liam Warfield
Liam Warfield

Reputation: 81

From a software design prospective, it looks like you need to do some data cleaning before trying to interact with the data.

Data Science Cycle

You need to define what you want to do. Do you want to get rid of all the nan values, or treat them as 0? What if an event starts in one month but ends in another? Once you have these questions, you can start thinking about what to code. Below would be the approach I would take (But I don't know exactly what you want to do with the data):

  1. Deal with the NaNs
  2. decide what month each data point is in, and add a column with that information (heres an example of how to get the month from python datatimes).
  3. use start and end date columns to create a new "duration" column
start               end                Duration    month   conc
26.12.2019 07:00    03.01.2020 07:00     x sec      Dec    0.235
03.01.2020 07:00    14.01.2020 07:00     y sec      Jan    nan
14.01.2020 07:00    20.01.2020 07:00     z sec      Jan    0.032
  1. Aggrigate the data by month. Add up all the durations in a month. Add up all the conc data in a month. avg = tot_conc / tot_duration. Then add this value to a new table:
 | month | avg_conc |
 |  Feb. |  2.218.  |

Upvotes: 0

Related Questions