Reputation: 59
I have a dataframe where I have a collection of the following entries: (Date, Volume).
I would like to create a new dataframe column where the Volume column is subtracted with the mean of the monthly volume. I would like what is the way to achieve that in pandas.
In below you can find the setup of the above:
import pandas as pd
import io
import numpy as np
df = pd.read_csv(io.StringIO(csv_file_content))
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
# Get the means for each year/month pair
month_means = df.groupby([df.index.year, df.index.month])['Volume'].mean().round(2)
# I would like to subtract the Volume with the mean of its month.
df['spread_monthly'] = df['Volume'] - month_means[zip(df.index.year, df.index.month)]
Upvotes: 1
Views: 1883
Reputation: 24322
You can also do this by groupby()
and transform()
method:
df['spread_monthly']=df['Volume']-df.groupby([df.index.month,df.index.year])['Volume'].transform('mean').values
Upvotes: 0
Reputation: 5918
Just to add another option with vectorized solution.
We can use groupby with pd.Grouper with freq=M
to get the volume spread per month.
Setting the Date
column to index is a choice.
Toy Example
df = pd.DataFrame({
'Date': pd.date_range('2020.10.10', periods=12, freq='15D'),
'Volume': np.arange(1,13)
})
df
Date Volume
0 2020-10-10 1
1 2020-10-25 2
2 2020-11-09 3
3 2020-11-24 4
4 2020-12-09 5
5 2020-12-24 6
6 2021-01-08 7
7 2021-01-23 8
8 2021-02-07 9
9 2021-02-22 10
10 2021-03-09 11
11 2021-03-24 12
Code
df['spread_monthly'] = df.groupby([pd.Grouper(key='Date', freq='M')]).transform('mean')
df['spread_monthly'] = df.spread_monthly - df.Volume
df
Output
Date Volume spread_monthly
0 2020-10-10 1 0.5
1 2020-10-25 2 -0.5
2 2020-11-09 3 0.5
3 2020-11-24 4 -0.5
4 2020-12-09 5 0.5
5 2020-12-24 6 -0.5
6 2021-01-08 7 0.5
7 2021-01-23 8 -0.5
8 2021-02-07 9 0.5
9 2021-02-22 10 -0.5
10 2021-03-09 11 0.5
11 2021-03-24 12 -0.5
Upvotes: 0
Reputation: 91
It seems to be complaining about the indexing with the grouped month_means and the original (datetime) index of df['Volume']. To avoid problems with the indexing, you can remove the different indices using x.values
for each series.
Do df['spread_monthly'] = df['Volume'].values - month_means[zip(df.index.year, df.index.month)].values
instead.
Upvotes: 2