Reputation: 13
I am new to python and have the following problem:
I have daily measurements that I imported into a pd.series object. The problem is that the data is such that there are always 31 measurements per month, irrespective of whether it is January or February. In case the respective month has less than 31 days, the measurements are set to zero for those days that exceed the last day of that month. However, missing data within a month is also set to zero. The data for the months of April and May looks like this.
1990-04-01 25.870
1990-04-01 26.205
1990-04-01 12.283
1990-04-01 19.630
1990-04-01 19.239
1990-04-01 23.614
1990-04-01 40.891
1990-04-01 41.152
1990-04-01 35.935
1990-04-01 25.682
1990-04-01 21.674
1990-04-01 15.818
1990-04-01 11.413
1990-04-01 16.522
1990-04-01 33.543
1990-04-01 28.727
1990-04-01 18.043
1990-04-01 10.326
1990-04-01 19.159
1990-04-01 21.848
1990-04-01 35.250
1990-04-01 39.152
1990-04-01 31.522
1990-04-01 23.152
1990-04-01 13.250
1990-04-01 20.705
1990-04-01 27.304
1990-04-01 24.478
1990-04-01 33.674
1990-04-01 32.591
1990-04-01 0.000
1990-05-01 40.370
1990-05-01 41.609
1990-05-01 47.478
1990-05-01 40.682
1990-05-01 42.587
1990-05-01 38.826
1990-05-01 35.543
1990-05-01 30.955
1990-05-01 23.543
1990-05-01 7.857
1990-05-01 0.000
1990-05-01 0.000
1990-05-01 0.000
1990-05-01 0.000
1990-05-01 0.000
1990-05-01 0.000
1990-05-01 54.133
1990-05-01 41.114
1990-05-01 44.739
1990-05-01 43.848
1990-05-01 26.739
1990-05-01 21.318
1990-05-01 26.750
1990-05-01 54.864
1990-05-01 33.000
1990-05-01 33.304
1990-05-01 34.304
1990-05-01 20.886
1990-05-01 20.250
1990-05-01 24.804
1990-05-01 28.091
Length: 62
Is there a way to drop those entries that do not belong to the respective months and provide a new time-index for each day? I need to keep the zeros within a month.
Upvotes: 1
Views: 109
Reputation: 375377
First I would replace the 0s with NaN (to symbolise missing data):
s.replace(0, np.nan, inplace=True)
One way is to create a function which acts on each group (month):
def f(s_month):
date = s_month.index[0]
# create the month long range
rng = pd.date_range(date, date + pd.offsets.MonthEnd(1), freq='D')
# throw away results longer than month length
s_month = s_month.iloc[0:len(rng)]
# reindex those remaining
s_month.index = rng
return s_month
Note: this requires that you have a DatetimeIndex, i.e. s.index = pd.to_datetime(s.index)
.
In [11]: s_month = s.loc['1990-04-01']
In [12]: f(s_month)
Out[12]:
1990-04-01 25.870
1990-04-02 26.205
1990-04-03 12.283
1990-04-04 19.630
...
1990-04-28 24.478
1990-04-29 33.674
1990-04-30 32.591
Freq: D, Name: Value, dtype: float64
Use this with a groupby apply:
In [13]: res = s.groupby(s.index).apply(f)
In [14]: res
Out[14]:
1990-04-01 1990-04-01 25.870
1990-04-02 26.205
1990-04-03 12.283
1990-04-04 19.630
...
Correct the MultiIndex:
In [15]: res.index = res.index.droplevel(0)
In [16]: res
Out[16]:
1990-04-01 25.870
1990-04-02 26.205
1990-04-03 12.283
...
Upvotes: 1