user308827
user308827

Reputation: 21971

Create climatology from pandas dataframe (append the mean by day-of-year)

import pandas as pd
import pdb, random

dates = pd.date_range('1950-01-01', '1953-12-31', freq='D')
data  = [int(1000*random.random()) for i in xrange(len(dates))]
cum_data = pd.Series(data, index=dates)

cum_data.head()
    1950-01-01    310
    1950-01-02    477
    1950-01-03    401
    1950-01-04    896
    1950-01-05     65
...
    1951-01-01    320
...
    1952-01-01    330
...
    1953-01-01    340

# Compute climatology
cum_data.groupby([cum_data.index.month, cum_data.index.day]).mean()

I would like to calculate the climatology for this dataframe i.e. find the mean of all values corresponding to Jan 1st (for the years 1950 to 1953), and then append the mean value to the dataframe for the time period from Jan 1st 1954 to Dec 31st 1960. How do I do that?

The expected output should be the original dataset for the years 1950 to 1953. Jan 1st 1954 should be the average of Jan 1st of 1950, 1951, 1952, 1953. In this case:

    1954-01-01    325
...
    1955-01-01    325
...
...
    1960-01-01    325

So, Jan 1st 1954 will equal Jan 1st 1955 ... Jan 1st 1960. This holds true for all the other days as well.

Upvotes: 1

Views: 478

Answers (1)

Zero
Zero

Reputation: 76927

You could use resample function over a year period AS

In [8]: cum_data.resample('AS', how='mean')
Out[8]:
1950-01-01    502.169863
1951-01-01    503.698630
1952-01-01    503.185792
1953-01-01    504.961644
Freq: AS-JAN, dtype: float64

Store this result into tmp

In [9]: tmp = cum_data.resample('AS', how='mean')

Change the index to desired time range

In [10]: tmp.index = (pd.date_range('1954-01-01', '1957-01-01', freq='AS'))

In [11]: tmp
Out[11]:
1954-01-01    502.169863
1955-01-01    503.698630
1956-01-01    503.185792
1957-01-01    504.961644
Freq: AS-JAN, dtype: float64

Fill in with the daily dates too.

In [12]: tmp = tmp.reindex(pd.date_range('1954-01-01', '1957-12-31', freq='D')).ffill()

In [13]: tmp.head()
Out[13]:
1954-01-01    502.169863
1954-01-02    502.169863
1954-01-03    502.169863
1954-01-04    502.169863
1954-01-05    502.169863
Freq: D, dtype: float64

And, then append to cum_data

In [14]: cum_data.append(tmp)
Out[14]:
1950-01-01    430
1950-01-02    125
1950-01-03    371
1950-01-04    906
1950-01-05    504
...
1957-12-28    504.961644
1957-12-29    504.961644
1957-12-30    504.961644
1957-12-31    504.961644
Length: 2922

Upvotes: 2

Related Questions