Reputation: 21971
id vi dates f_id
0 5532714 0.549501 2015-07-07 ff_22
1 5532715 0.540969 2015-07-08 ff_22
2 5532716 0.531477 2015-07-09 ff_22
3 5532717 0.521029 2016-07-09 ff_22
4 5532718 0.509694 2017-07-11 ff_22
In the dataframe above, I want to compute climatology i.e. daily average for a dataframe spanning multiple years. Currently I tried this:
df.index = pd.to_datetime(df.dates)
df.groupby([df.index.month, df.index.day])['vi'].transform('mean')
I want the output to not have a multi-index.
Upvotes: 2
Views: 428
Reputation: 862701
I think you need assign to new column:
df.index = pd.to_datetime(df.dates)
df['new'] = df.groupby([df.index.month, df.index.day])['vi'].transform('mean')
print (df)
id vi dates f_id new
dates
2015-07-07 5532714 0.549501 2015-07-07 ff_22 0.549501
2015-07-08 5532715 0.540969 2015-07-08 ff_22 0.540969
2015-07-09 5532716 0.531477 2015-07-09 ff_22 0.526253
2016-07-09 5532717 0.521029 2016-07-09 ff_22 0.526253
2017-07-11 5532718 0.509694 2017-07-11 ff_22 0.509694
df.index = pd.to_datetime(df.dates)
df = df.assign(new=df.groupby([df.index.month, df.index.day])['vi'].transform('mean'))
print (df)
id vi dates f_id new
dates
2015-07-07 5532714 0.549501 2015-07-07 ff_22 0.549501
2015-07-08 5532715 0.540969 2015-07-08 ff_22 0.540969
2015-07-09 5532716 0.531477 2015-07-09 ff_22 0.526253
2016-07-09 5532717 0.521029 2016-07-09 ff_22 0.526253
2017-07-11 5532718 0.509694 2017-07-11 ff_22 0.509694
And then if necessary reset_index
:
df = df.reset_index(drop=True)
print (df)
id vi dates f_id new
0 5532714 0.549501 2015-07-07 ff_22 0.549501
1 5532715 0.540969 2015-07-08 ff_22 0.540969
2 5532716 0.531477 2015-07-09 ff_22 0.526253
3 5532717 0.521029 2016-07-09 ff_22 0.526253
4 5532718 0.509694 2017-07-11 ff_22 0.509694
But it seems better is convert column date
to datetime
, then reset_index
is not necessary:
df.dates = pd.to_datetime(df.dates)
df['new'] = df.groupby([df.dates.dt.month, df.dates.dt.day])['vi'].transform('mean')
print (df)
id vi dates f_id new
0 5532714 0.549501 2015-07-07 ff_22 0.549501
1 5532715 0.540969 2015-07-08 ff_22 0.540969
2 5532716 0.531477 2015-07-09 ff_22 0.526253
3 5532717 0.521029 2016-07-09 ff_22 0.526253
4 5532718 0.509694 2017-07-11 ff_22 0.509694
If for some reason cannot convert column to datetime
is possible use:
d = pd.to_datetime(df.dates)
df['new'] = df.groupby([d.dt.month, d.dt.day])['vi'].transform('mean')
print (df)
id vi dates f_id new
0 5532714 0.549501 2015-07-07 ff_22 0.549501
1 5532715 0.540969 2015-07-08 ff_22 0.540969
2 5532716 0.531477 2015-07-09 ff_22 0.526253
3 5532717 0.521029 2016-07-09 ff_22 0.526253
4 5532718 0.509694 2017-07-11 ff_22 0.509694
Difference between transform
and aggregation
:
#changed sample data - dates
print (df)
id vi dates f_id
0 5532714 0.549501 2015-07-07 ff_22
1 5532715 0.540969 2016-07-07 ff_22
2 5532716 0.531477 2015-07-09 ff_22
3 5532717 0.521029 2016-07-11 ff_22
4 5532718 0.509694 2017-07-11 ff_22
GroupBy.transform
is used for new columns in original DataFrame
- size of output df
is not changed, so same data for same group:
df.dates = pd.to_datetime(df.dates)
df['new'] = df.groupby([df.dates.dt.month, df.dates.dt.day])['vi'].transform('mean')
print (df)
id vi dates f_id new
0 5532714 0.549501 2015-07-07 ff_22 0.545235
1 5532715 0.540969 2016-07-07 ff_22 0.545235
2 5532716 0.531477 2015-07-09 ff_22 0.531477
3 5532717 0.521029 2016-07-11 ff_22 0.515362
4 5532718 0.509694 2017-07-11 ff_22 0.515362
But if aggregate by GroupBy.mean
size of output df
is changed, but another columns are not in output (if need them, is necessary aggregate each column by agg
):
df.dates = pd.to_datetime(df.dates)
df1 = df.groupby([df.dates.dt.month, df.dates.dt.day])['vi'].mean()
.rename_axis(('months', 'days')).reset_index()
print (df1)
months days vi
0 7 7 0.545235
1 7 9 0.531477
2 7 11 0.515362
Upvotes: 2
Reputation: 294288
If I understand you correctly, you want to create a categorical index to group by. I chose to format the dates with '%b, %d'
but you can use any month/day combination derived from the options at strftime.org.
Consider the dataframe df
consisting of 4 years of random daily data:
tidx = pd.date_range('2010-01-01', '2013-12-31', name='dates')
np.random.seed([3,1415])
df = pd.DataFrame(dict(vi=np.random.rand(tidx.size)), tidx)
I'll choose to make my category labels from 2012
because it is a leap year and will include Feb 29
. By creating categories, pandas will know that Jan, 01
comes before Aug, 31
.
fmt = '%b, %d'
categories = pd.CategoricalIndex(
df.index.strftime(fmt),
pd.date_range('2012-01-01', '2012-12-31').strftime(fmt)
)
df.groupby(categories).mean()
vi
Jan, 01 0.452812
Jan, 02 0.422205
Jan, 03 0.409369
Jan, 04 0.620173
Jan, 05 0.456044
Jan, 06 0.616172
Jan, 07 0.693760
Jan, 08 0.636427
Jan, 09 0.490981
Jan, 10 0.412720
Jan, 11 0.516009
Jan, 12 0.372129
... ...
Dec, 20 0.523669
Dec, 21 0.372038
Dec, 22 0.609447
Dec, 23 0.568481
Dec, 24 0.612216
Dec, 25 0.765035
Dec, 26 0.368508
Dec, 27 0.558909
Dec, 28 0.371733
Dec, 29 0.516978
Dec, 30 0.654397
Dec, 31 0.793721
[366 rows x 1 columns]
Upvotes: 3