user308827
user308827

Reputation: 21971

Daily climatology for pandas dataframe

        id      vi       dates     f_id
0  5532714  0.549501  2015-07-07    ff_22
1  5532715  0.540969  2015-07-08    ff_22
2  5532716  0.531477  2015-07-09    ff_22
3  5532717  0.521029  2016-07-09    ff_22
4  5532718  0.509694  2017-07-11    ff_22

In the dataframe above, I want to compute climatology i.e. daily average for a dataframe spanning multiple years. Currently I tried this:

df.index = pd.to_datetime(df.dates)
df.groupby([df.index.month, df.index.day])['vi'].transform('mean')

I want the output to not have a multi-index.

Upvotes: 2

Views: 428

Answers (2)

jezrael
jezrael

Reputation: 862701

I think you need assign to new column:

df.index = pd.to_datetime(df.dates)
df['new'] = df.groupby([df.index.month, df.index.day])['vi'].transform('mean')
print (df)
                 id        vi       dates   f_id       new
dates                                                     
2015-07-07  5532714  0.549501  2015-07-07  ff_22  0.549501
2015-07-08  5532715  0.540969  2015-07-08  ff_22  0.540969
2015-07-09  5532716  0.531477  2015-07-09  ff_22  0.526253
2016-07-09  5532717  0.521029  2016-07-09  ff_22  0.526253
2017-07-11  5532718  0.509694  2017-07-11  ff_22  0.509694

df.index = pd.to_datetime(df.dates)
df = df.assign(new=df.groupby([df.index.month, df.index.day])['vi'].transform('mean'))
print (df)
                 id        vi       dates   f_id       new
dates                                                     
2015-07-07  5532714  0.549501  2015-07-07  ff_22  0.549501
2015-07-08  5532715  0.540969  2015-07-08  ff_22  0.540969
2015-07-09  5532716  0.531477  2015-07-09  ff_22  0.526253
2016-07-09  5532717  0.521029  2016-07-09  ff_22  0.526253
2017-07-11  5532718  0.509694  2017-07-11  ff_22  0.509694

And then if necessary reset_index:

df = df.reset_index(drop=True)
print (df)
        id        vi       dates   f_id       new
0  5532714  0.549501  2015-07-07  ff_22  0.549501
1  5532715  0.540969  2015-07-08  ff_22  0.540969
2  5532716  0.531477  2015-07-09  ff_22  0.526253
3  5532717  0.521029  2016-07-09  ff_22  0.526253
4  5532718  0.509694  2017-07-11  ff_22  0.509694

But it seems better is convert column date to datetime, then reset_index is not necessary:

df.dates = pd.to_datetime(df.dates)
df['new'] = df.groupby([df.dates.dt.month, df.dates.dt.day])['vi'].transform('mean')
print (df)

        id        vi      dates   f_id       new
0  5532714  0.549501 2015-07-07  ff_22  0.549501
1  5532715  0.540969 2015-07-08  ff_22  0.540969
2  5532716  0.531477 2015-07-09  ff_22  0.526253
3  5532717  0.521029 2016-07-09  ff_22  0.526253
4  5532718  0.509694 2017-07-11  ff_22  0.509694

If for some reason cannot convert column to datetime is possible use:

d = pd.to_datetime(df.dates)
df['new'] = df.groupby([d.dt.month, d.dt.day])['vi'].transform('mean')
print (df)
        id        vi       dates   f_id       new
0  5532714  0.549501  2015-07-07  ff_22  0.549501
1  5532715  0.540969  2015-07-08  ff_22  0.540969
2  5532716  0.531477  2015-07-09  ff_22  0.526253
3  5532717  0.521029  2016-07-09  ff_22  0.526253
4  5532718  0.509694  2017-07-11  ff_22  0.509694

---

Difference between transform and aggregation:

#changed sample data - dates    
print (df)
        id        vi       dates   f_id
0  5532714  0.549501  2015-07-07  ff_22
1  5532715  0.540969  2016-07-07  ff_22
2  5532716  0.531477  2015-07-09  ff_22
3  5532717  0.521029  2016-07-11  ff_22
4  5532718  0.509694  2017-07-11  ff_22

GroupBy.transform is used for new columns in original DataFrame - size of output df is not changed, so same data for same group:

df.dates = pd.to_datetime(df.dates)
df['new'] = df.groupby([df.dates.dt.month, df.dates.dt.day])['vi'].transform('mean')
print (df)
        id        vi      dates   f_id       new
0  5532714  0.549501 2015-07-07  ff_22  0.545235
1  5532715  0.540969 2016-07-07  ff_22  0.545235
2  5532716  0.531477 2015-07-09  ff_22  0.531477
3  5532717  0.521029 2016-07-11  ff_22  0.515362
4  5532718  0.509694 2017-07-11  ff_22  0.515362

But if aggregate by GroupBy.mean size of output df is changed, but another columns are not in output (if need them, is necessary aggregate each column by agg):

df.dates = pd.to_datetime(df.dates)
df1 = df.groupby([df.dates.dt.month, df.dates.dt.day])['vi'].mean()
        .rename_axis(('months', 'days')).reset_index()
print (df1)
   months  days        vi
0       7     7  0.545235
1       7     9  0.531477
2       7    11  0.515362

Upvotes: 2

piRSquared
piRSquared

Reputation: 294288

If I understand you correctly, you want to create a categorical index to group by. I chose to format the dates with '%b, %d' but you can use any month/day combination derived from the options at strftime.org.

Consider the dataframe df consisting of 4 years of random daily data:

tidx = pd.date_range('2010-01-01', '2013-12-31', name='dates')

np.random.seed([3,1415])
df = pd.DataFrame(dict(vi=np.random.rand(tidx.size)), tidx)

I'll choose to make my category labels from 2012 because it is a leap year and will include Feb 29. By creating categories, pandas will know that Jan, 01 comes before Aug, 31.

fmt = '%b, %d'
categories = pd.CategoricalIndex(
    df.index.strftime(fmt),
    pd.date_range('2012-01-01', '2012-12-31').strftime(fmt)
)
df.groupby(categories).mean()

               vi
Jan, 01  0.452812
Jan, 02  0.422205
Jan, 03  0.409369
Jan, 04  0.620173
Jan, 05  0.456044
Jan, 06  0.616172
Jan, 07  0.693760
Jan, 08  0.636427
Jan, 09  0.490981
Jan, 10  0.412720
Jan, 11  0.516009
Jan, 12  0.372129
...           ...
Dec, 20  0.523669
Dec, 21  0.372038
Dec, 22  0.609447
Dec, 23  0.568481
Dec, 24  0.612216
Dec, 25  0.765035
Dec, 26  0.368508
Dec, 27  0.558909
Dec, 28  0.371733
Dec, 29  0.516978
Dec, 30  0.654397
Dec, 31  0.793721

[366 rows x 1 columns]

Upvotes: 3

Related Questions